Learning Data Science
Lecture 8
Machine Learning
SQL and SQL Databases
SQL = Structured Query Language
What is it?
A programming language for managing data in a relational database.

Relational Databases
Susie
Jay
Lara
| Trig | Alg | Geom | Calc |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Susie
Jay
Lara
| Last | ID | Uni | Cats |
|---|---|---|---|
| Jones | 45 | TUM | 0 |
| Sun | 48 | LMU | 6 |
| Blue | 66 | LMU | 1 |
grades
students
Susie
Jay
Lara
unis
LMU
Ulm
TUM
| City | Students | Courses |
|---|---|---|
| Munich | [48, 66] | [Trig, Alg] |
| Munich | [45] | [Geom, Calc] |
| Ulm | [] | [Trig, Alg, Calc] |
courses
Geo
Alg
Trig
| ID | Prof ID | Students |
|---|---|---|
| 1 | 44 | [45, 48, 66] |
| 2 | 154 | [45, 66] |
| 3 | 22 | [45, 48] |

Movie rental store 🍿
APIs
Application Programming Interface
Think of APIs like a waiter at a restaurant:
- You get a list of things you can do
- You request them from an API
- Then the information you want gets delivered back




Monte Carlo (MC) Methods
Use random sampling to estimate a very complicated probability distribution
-
scipy.constants
-
scipy.stats
-
scipy.integrate
-
scipy.interpolate
-
scipy.optimize
-
scipy.fft
SciPy
Tons of tools and highly optimized algorithms for doing math and science
Lecture 8
- Recap
- What is Machine Learning?
- ML Landscape
- Hands on with scikit-learn
Goals for today
- Understand the basic principles of ML
- Intuitive understanding of major models
- Picking the right ML model for a task
Tutorial: code up our own ML model
What is ML?
Machine Learning:
- statistical algorithms
- learn from data
- generalize to unseen data
What is ML?
f
Inputs
Outputs
We assume that everything has an underlying function, no matter how complex
What is ML?
f
Inputs
Outputs
Is it a cat or a dog?
What is ML?
f
Inputs
Outputs
What is the price of the house?
What is ML?
f
Inputs
Outputs
How many classes of galaxies are there?






What is ML?
f
Inputs
Outputs
How can you help a student pick which degrees would be most interesting for them?
Susie
Jay
Lara
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
What is ML?
f
Inputs
Outputs
In most cases, we will never be able to determine the exact function
ML: algorithms that approximate these complex, non-linear functions as well as possible, without manual intervention
There are hundreds of algorithms to choose from!
Supervised
ML Algorithms
Unsupervised
Supervised
Supervised learning in a nutshell
f
Inputs
Outputs
Known
Known
Use labeled data to slowly push the unknown function towards correctness
Supervised
Supervised learning in a nutshell
- Gather some data
- Feed your data to the algorithm
- Let the algorithm guess the output
- Compare guess to correct answers
- Adjust algorithm towards a better guess
- Repeat

Classification
Supervised
Regression
Output is limited set of categories
Output is somewhere along a number line
Supervised
ML Algorithms
Unsupervised
Unsupervised
Unsupervised learning in a nutshell
f
Inputs
Outputs
Known
unknown
Use unlabeled data and allow the algorithm to find patters on its own
Unsupervised
Clustering
Dimensionality
Clustering
Feed it tons of galaxy images



Too many to label by hand!
Why not let an algorithm figure out how many classes there are
Unsupervised
Dimensionality
Reducing the dimensionality of your data
Unsupervised
Susie
Jay
Lara
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
How can you help a student pick which degree would be most interesting for them?
Now imagine you had 100 grades to look at per student
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Susie
Jay
Lara
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Susie
Jay
Lara
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Susie
Jay
Lara
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Susie
Jay
Lara
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Ana
Juli
Max
| Trig | Alg | Music | Hist |
|---|---|---|---|
| 1.3 | 1.3 | 3.7 | 2.3 |
| 4.0 | 2.0 | 2.3 | |
| 1.3 | 1.0 | 3.0 |
Susie
Jay
Lara
| Math | Humanities |
|---|---|
| 1.3 | 1.2 |
| 3.0 | 2.3 |
| 1.1 | 2.0 |
Dimensionality
Reducing the dimensionality of your data
Unsupervised
Susie
Jay
Lara
| Math | Humanities |
|---|---|
| 1.3 | 1.2 |
| 3.0 | 2.3 |
| 1.1 | 2.0 |
Let the algorithm find 'hidden axes' in your data
Supervised
ML Algorithms
Unsupervised
Classification
Regression
Clustering
Dimensionality
Lecture 8
- Recap
- What is Machine Learning?
- ML Landscape
- Hands on with scikit-learn
Just some of the algorithms that are out there:
Time for today:
- Linear regression
- Logistic regression
- k-nearest neighbors
- Decision trees
- k-means clustering
- Random forests
The standard pipline
- Gather your data
- Exploratory Data Analysis
- Cleaning and feature scaling
- Train/Test splitting
- Train model
- Evaluate model

Linear Regression

Usually has an analytical solution!
Supervised + Regression
Logistic Regression
Variant of linear regression for binary classification

Supervised + Classification
k-nearest neighbors
Supervised + Classification

k-nearest neighbors
Supervised + Regression

hyperparameters: choosing k
k is a so-called "hyperparameter"
A hyperparameter is a parameter you choose before training
Optimization of hyperparameters is an art!
⚠️ Overfitting and Underfitting

- Overfitting: learning hyper-specific patterns that won't recur in the future
- Underfitting: failing to capture important patterns
Decision Trees
Supervised + Regression/Classification

Decision Trees
Supervised + Regression/Classification

k-means clustering

Unsupervised + Classification
Random Forests
Supervised + Regression/Classification
- Among the most powerful and well-used algorithms
- Many variations that increase their power
- Generally a great choice for good performance on tabular data
Random Forests

Supervised + Regression/Classification
Lecture 8
- Recap
- What is Machine Learning?
- ML Landscape
- Hands on with scikit-learn
To the notebook!
Lecture 8
- Recap
- What is Machine Learning?
- ML Landscape
- Hands on with scikit-learn
The End
Learning Data Science Lecture 8
By astrojarred