Learning Data Science

Lecture 8
Machine Learning

SQL and SQL Databases

SQL = Structured Query Language

What is it?

 

A programming language for managing data in a relational database.

Relational Databases

Susie

Jay

Lara

Trig Alg Geom Calc
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Susie

Jay

Lara

Last ID Uni Cats
Jones 45 TUM 0
Sun 48 LMU 6
Blue 66 LMU 1
grades
students

Susie

Jay

Lara

unis

LMU

Ulm

TUM

City Students Courses
Munich [48, 66] [Trig, Alg]
Munich [45] [Geom, Calc]
Ulm [] [Trig, Alg, Calc]
courses

Geo

Alg

Trig

ID Prof ID Students
1 44 [45, 48, 66]
2 154 [45, 66]
3 22 [45, 48]

Movie rental store 🍿

APIs

Application Programming Interface

Think of APIs like a waiter at a restaurant:

  • You get a list of things you can do
  • You request them from an API
  • Then the information you want gets delivered back

Monte Carlo (MC) Methods

Use random sampling to estimate a very complicated probability distribution

  • scipy.constants
  • scipy.stats
  • scipy.integrate
  • scipy.interpolate
  • scipy.optimize
  • scipy.fft

SciPy

Tons of tools and highly optimized algorithms for doing math and science

Lecture 8

  1. Recap
  2. What is Machine Learning?
  3. ML Landscape
  4. Hands on with scikit-learn

Goals for today

  • Understand the basic principles of ML
  • Intuitive understanding of major models
  • Picking the right ML model for a task

Tutorial: code up our own ML model

What is ML?

Machine Learning:

  • statistical algorithms 
  • learn from data
  • generalize to unseen data

What is ML?

f

Inputs

Outputs

We assume that everything has an underlying function, no matter how complex

What is ML?

f

Inputs

Outputs

Is it a cat or a dog?

What is ML?

f

Inputs

Outputs

What is the price of the house?

What is ML?

f

Inputs

Outputs

How many classes of galaxies are there?

What is ML?

f

Inputs

Outputs

How can you help a student pick which degrees would be most interesting for them?

Susie

Jay

Lara

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

What is ML?

f

Inputs

Outputs

In most cases, we will never be able to determine the exact function

ML: algorithms that approximate these complex, non-linear functions as well as possible, without manual intervention

There are hundreds of algorithms to choose from!

Supervised

ML Algorithms

Unsupervised

Supervised

Supervised learning in a nutshell

f

Inputs

Outputs

Known

Known

Use labeled data to slowly push the unknown function towards correctness

Supervised

Supervised learning in a nutshell

  1. Gather some data
  2. Feed your data to the algorithm 
  3. Let the algorithm guess the output
  4. Compare guess to correct answers
  5. Adjust algorithm towards a better guess
  6. Repeat

Classification

Supervised

Regression

Output is limited set of categories
Output is somewhere along a number line

Supervised

ML Algorithms

Unsupervised

Unsupervised

Unsupervised learning in a nutshell

f

Inputs

Outputs

Known

unknown

Use unlabeled data and allow the algorithm to find patters on its own

Unsupervised

Clustering

Dimensionality

Clustering

Feed it tons of galaxy images

Too many to label by hand!
Why not let an algorithm figure out how many classes there are

Unsupervised

Dimensionality

Reducing the dimensionality of your data

Unsupervised

Susie

Jay

Lara

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

How can you help a student pick which degree would be most interesting for them?

Now imagine you had 100 grades to look at per student

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Susie

Jay

Lara

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Susie

Jay

Lara

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Susie

Jay

Lara

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Susie

Jay

Lara

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Ana

Juli

Max

Trig Alg Music Hist
1.3 1.3 3.7 2.3
4.0 2.0 2.3
1.3 1.0 3.0

Susie

Jay

Lara

Math Humanities
1.3 1.2
3.0 2.3
1.1 2.0

Dimensionality

Reducing the dimensionality of your data

Unsupervised

Susie

Jay

Lara

Math Humanities
1.3 1.2
3.0 2.3
1.1 2.0

Let the algorithm find 'hidden axes' in your data

Supervised

ML Algorithms

Unsupervised

Classification

Regression

Clustering

Dimensionality

Lecture 8

  1. Recap
  2. What is Machine Learning?
  3. ML Landscape
  4. Hands on with scikit-learn

Just some of the algorithms that are out there:

Time for today:

  1. Linear regression
  2. Logistic regression
  3. k-nearest neighbors
  4. Decision trees
  5. k-means clustering
  6. Random forests

The standard pipline

  1. Gather your data
  2. Exploratory Data Analysis
  3. Cleaning and feature scaling
  4. Train/Test splitting
  5. Train model
  6. Evaluate model

Linear Regression

Usually has an analytical solution!

Supervised + Regression

Logistic Regression

Variant of linear regression for binary classification

Supervised + Classification

k-nearest neighbors

Supervised + Classification

k-nearest neighbors

Supervised + Regression

hyperparameters: choosing k

k is a so-called "hyperparameter"

 

A hyperparameter is a parameter you choose before training

Optimization of hyperparameters is an art!

⚠️ Overfitting and Underfitting

  • Overfitting: learning hyper-specific patterns that won't recur in the future
  • Underfitting: failing to capture important patterns

Decision Trees

Supervised + Regression/Classification

Decision Trees

Supervised + Regression/Classification

k-means clustering

Unsupervised + Classification

Random Forests

Supervised + Regression/Classification

  • Among the most powerful and well-used algorithms
  • Many variations that increase their power
  • Generally a great choice for good performance on tabular data

Random Forests

Supervised + Regression/Classification

Lecture 8

  1. Recap
  2. What is Machine Learning?
  3. ML Landscape
  4. Hands on with scikit-learn

To the notebook!

Lecture 8

  1. Recap
  2. What is Machine Learning?
  3. ML Landscape
  4. Hands on with scikit-learn

The End

Learning Data Science Lecture 8

By astrojarred

Private

Learning Data Science Lecture 8