TrafficDefender

Data Scientist Technical Test

Threat: Scraping

John Modin

Approach:

Binary classifier i.e. SVM (Supervised learning)

  • Machine learning is a powerful technique with enough data 
  • Can leverage data from other servers
  • Labeled data can be created using simulated attacks
  • Drawback: Requires creative simulations

Explorative data analysis

  • Invariables
  • Correlations
  • Labeled plots
  • Anything suspicious?
  • Any heuristics?

Create features

Pick model

  • Try good candidates on c.v - set

Tune model

  • i.e. weights, regularization
  • make plots for accuracy,precision and recall.

Data processing:

  • Extract features from data, these features are connected to IP addresses. 
  • Normalize features (depending on model, for SVM this is a good idea)

Possible features:

  • # requests per session
  • time per request
  • var(time session start)
  • var(#requests)
  • session during request peak?
  • periodicity to "similar" sessions
  • hidden links/sites visited? (hidden with CSS)
  • request features (i.e. GET / POST, any content/searches?)

Tech stack

  • Language:Python
  • Data processing/ML: sklearn, numpy, pandas
  • Task manager: luigi
  • Storage: HDFS? (Depends on data size)
  • Compute engine: Spark? (Depends on data size)

TrafficDefender

By jmmodin

TrafficDefender

  • 212