Explaining machineJS To My Parents Part 3:

Training, Tuning, Selecting, and Validating your machine learning algorithms

Today's Tour

  • Overfitting
  • Cross-Validation
    • ​Defines "right" for us
  • ​Tuning
    • Finding the right version of a given algorithm
  • ​Selecting
    • Finding the right algorithm
  • ​Ensembling
    • ​Using machine learning to assemble together all your machine learning results!
  • ​machineJS does all this for you!

Overfitting

  • Machine learning is really good at learning all the trends in a data set...
    • ... sometimes too good
  • If you're not careful, it will just memorize the data you gave it
    • "There are far more efficient ways to store data than inside a random forest"
      • ​-mlwave.com​​

Overfitting Example

  • Let's say we're trying to figure out which dots belong to the blue group, and which belong to the red group:
  • As humans, we can see pretty clearly that the black line is a better differentiator
  • But the computer, if we're not careful, is going to get excited and learn the training data really well, drawing the hyperspecific green line
  • Clearly, the green line will not generalize to other data well!

image credit: mlwave.com

Overfitting- Problem

  • Let's say we're trying to figure out which dots belong to the blue group, and which belong to the red group:
  • Again, the point of machine learning is to make predictions on new data, so what we want is a general solution, not one that's hyperspecific to only this training data
  • And clearly even though we'll predict a couple points in the training data incorrectly, the black line will make much more sense for new data than the green line

image credit: mlwave.com

Cross-Validation

  • The solution to over-fitting!
  • Train the algorithm on one dataset, but then test it on a different dataset!
  • This prevents you from asking the machine "how well did you memorize the data I already gave you?"
    • hint: machines are really good at memorizing!
  • Instead, you're now asking how well it learned patterns that are broadly useful
  • "Ok, so let's see how well you can apply what you've learned when we ask you about some new data points"

Cross-Validation

  • Typically you'll split your incoming dataset into 80% to train the algorithm on, and 20% to test it on
  • This lets us define how accurate a model is, by seeing how good the predictions are that the algorithm makes on the test data

Cross-Validation

  • machineJS relies heavily on cross-validation to figure out define what's "right"

Tuning

  • There are all kinds of shapes and sizes a particular machine learning algorithm can take on

image credit: mathworks.com

How Deep?

How Many Ways to Split?

 

Right now these are all splitting two different ways

How Many Trees?

Tuning

  • In order to pick the best possible "shape" for a given machine learning algorithm, we basically just try a bunch of different options
  • This is a tedious process- just try a bunch of new parameters, and see how that impacts the observed accuracy at the end (using cross-validation!)

Tuning

  • machineJS does all this for you!

Selecting the Best Algorithm

  • There are (depending on what you're trying to do) a half dozen or a dozen algorithms that might be really useful for your problem
  • Which algorithm do you pick? And then, once you've selected  which algorithm to use, which set of parameters do you choose to tune that algorithm with?

Selecting the Best Algorithm

  • This, again, is a tedious process
  • Just try a bunch of things and see which one works best

Selecting the Best Algorithm

  • machineJS does all this for you as well!

Ensembling

  • Ok, so in the previous stage, we trained up dozens or hundreds of machine learning algorithms to find the best ones
  • We've probably trained up quite a few that are useful- now we need to figure out how to put together all these predictions!

Ensembling

  • One basic thing we could do is to average together the 5 best algorithms
  • Another thing we could do is to pick the two algorithms that are least alike, and average together their results
    • Or, we could pick the highest value of all the predicted results. 
    • Or the lowest
    • Or the average ignoring the extremes
    • Or the most extreme value
    • Or the...

Ensembling

  • This is getting tough. 
  • But luckily, we have figured out some great machine learning algorithms that can take in a huge number of data points and make sense of them all
  • Sooo, let's feed the results of our earlier predictions into another round of machine learning to get our final results!

Ensembling

  • The logistics of doing this are a bit more complicated, but the understanding is pretty easy:
    • We trained up a bunch of machine learning algorithms to find the best one
    • Instead of just taking the best one, we think we can use the results of several to be more effective!
    • The way we pick which results to use is... you guessed it, machine learning!

Ensembling

  • As you might have guessed, machineJS does this for you too :)

machineJS

  • Automates the entire process
    • Uses cross-validation to:
      • Find the right "shape" for each algorithm
      • Find the right algorithm
      • Ensemble together the results from various algorithms
  • Freeing you, the machine learning engineer, up to focus on the interesting parts (feature engineering, figuring out how to put this into practice and make it useful for the business, creative ensembling, etc.)

Thanks!

you can find machineJS at 

https://github.com/ClimbsRocks/machineJS

Made with Slides.com