Explaining machineJS To My Parents Part 2:

Data Formatting

Today's Tour

  • Why do we need this at all?
  • Data Cleaning
    • Missing values
    • Long to wide
  • Feature Engineering
    • ​Dates into day of week
  • ​Feature selection

Why do we need this?

  • In short:
    • Machine learning algorithms expect data to be in a particular format, which is not the same format that we use for storing or transporting data
    • Data is messy and needs to be cleaned
    • We need to make sense of the raw data for the computer

Why do we need this?

  • Basic example
  • How our data is stored:

 

 

 

  • How the computer expects to see our data:

 

 

 

In this example, the first position is for age, the next is a 0 or 1 value for whether this person rock climbs or not, then the rest of the columns are 0 or 1 values for sate=CA, state=NC, and state=WA

age    climbs    state
28     yes       CA
26     no        NC
23     yes       WA
[28, 1, 1, 0, 0]
[26, 0, 0, 1, 0]
[23, 1, 0, 0, 1]

Data Cleaning- Missing Values

  • Data is messy- frequently filled with missing or wildly inaccurate values
  • This takes up way more time for analysts and data scientists than you'd expect

Data Cleaning- Missing Values

  • Why do we have missing values?
  • When you sign up for a rewards card at a grocery store, they typically have plenty of optional fields, like work phone, or occupation, or favorite color of loofah
  • If I'm diligent and tell them that orange is my favorite loofah color, and you don't tell them anything, that shows up as a missing value!
  • Or maybe you totally would tell them your favorite loofah color, but they just introduced that question last week, and you signed up months ago
  • Or maybe we're taking data from a bunch of different grocery stores, including one that doesn't sell loofahs (the horror!)

Data Cleaning- Missing Values

  • So what can we do?
  • The obvious solution is to fill in with average values for that column
    • If the average number of children is 2, then for any rows missing that value, we will assume it should be 2

Data Cleaning- Missing Values

  • Because we don't know what will matter though, we'll keep a tally of how many values each row was missing, as well as a copy of the original data 
    • Think of a mortgage application- it seems like knowing how many incomplete fields there are would be useful
    • And how we might not want to always assume a borrower who didn't fill in their income has an average income

Feature Engineering

  • Feature engineering means turning the raw data into useful information from the algorithm to use

Feature Engineering

  • Classic example: dates
  • A date on it's own is just a bunch of random numbers, like "1-1-2016"
  • The computer won't be able to make sense of this on it's own
  • But there's a lot of information in that data that could be really useful!
    • Day of week
    • Month of the year
    • Whether it's a holiday or not
    • Length of time since the first data in our data set

Feature Engineering

  • The great part about this is that since the computer is really good at figuring out what's important and what's not, we can give the computer everything we think might be useful, and let it figure out what actually is or is not!

Feature Selection

  • Now that we've taken the raw data, and performed feature engineering, let's figure out which pieces of information are actually useful
  • Turns out, computers are really good at this too

Feature Selection

  • Why not just give the computer everything? Why do we have to prune at all?
    • Much faster computing times
    • Less noise
    • Tells you, the user, what matters and what doesn't, which you can use when making decisions

machineJS Does All This!

  • This is a super repetitive process
  • machineJS does all this for you
  • NOTE: machineJS performs some of the rote feature engineering for you, with the explicit goal of freeing up more of your time to perform the more advanced feature engineering yourself, based on your expertise or intuition for this particular data set

Thanks!

You can find machineJS at:

github.com/ClimbsRocks/machineJS

Explaining machineJS To My Parents Part 2: Data Formatting

By Preston Parry

Explaining machineJS To My Parents Part 2: Data Formatting

Or, "why can't we just get into the fun part yet? Why do we have to mess around with the data beforehand?"

  • 1,380