Data Scientist Technical Test
Threat: Scraping
John Modin
Binary classifier i.e. SVM (Supervised learning)
- Machine learning is a powerful technique with enough data
- Can leverage data from other servers
- Labeled data can be created using simulated attacks
- Drawback: Requires creative simulations
Explorative data analysis
- Invariables
- Correlations
- Labeled plots
- Anything suspicious?
- Any heuristics?
Create features
Pick model
- Try good candidates on c.v - set
Tune model
- i.e. weights, regularization
- make plots for accuracy,precision and recall.
Data processing:
- Extract features from data, these features are connected to IP addresses.
- Normalize features (depending on model, for SVM this is a good idea)
Possible features:
- # requests per session
- time per request
- var(time session start)
- var(#requests)
- session during request peak?
- periodicity to "similar" sessions
- hidden links/sites visited? (hidden with CSS)
- request features (i.e. GET / POST, any content/searches?)
Tech stack
- Language:Python
- Data processing/ML: sklearn, numpy, pandas
- Task manager: luigi
- Storage: HDFS? (Depends on data size)
- Compute engine: Spark? (Depends on data size)
By jmmodin
- 212