what is machine learning


machine learning best practices


issues in data ethics


epistemic transparency


where does the bias enter models


what is machine learning?



a model is a low dimensional representation of a higher dimensionality datase

the best way to think about it    in the ML context:


what is a model?

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959


what is machine learning?


parameters: slope, intercept

what is machine learning?

parameters: slope, intercept



what is machine learning?

ML: any model with parameters learnt from the data

Machine Learning models are parametrized representation of "reality"  where the parameters are learned from finite sets of realizations of that reality

(note: learning by instance, e.g. nearest neighbours, may not comply to this definition)

Machine Learning is the disciplines that conceptualizes, studies, and applies those models.

Key Concept

what is  machine learning?


used to:

  • understand structure of feature space
  • classify based on examples,
  • regression (classification with infinitely small classes)
  • understand which features are important in prediction (to get close to causality)

General ML points

unsupervised vs supervised learning


Unsupervised learning

  • understanding structure  
  • anomaly detection
  • dimensionality reduction

All features are observed for all datapoints

unsupervised vs supervised learning

Unsupervised learning

Supervised learning

All features are observed for all datapoints

and we are looking for structure in the feature space

Some features are not observed for some data points we want to predict them.

The datapoints for which the target feature is observed are said to be "labeled"

Semi-supervised learning

Active learning

A small amount of labeled data is available. Data is cluster and clusters inherit labels

The code can interact with the user to update labels.


unsupervised vs supervised learning

what is machine learning?

extract features and create models that allow prediction where the correct answer is known for a subset of the data

supervised learning

identify features and create models that allow to understand structure in the data

unsupervised learning

  • k-Nearest Neighbors

  • Regression

  • Support Vector Machines

  • Neural networks

  • Classification/Regression Trees

  • clustering

  • Principle Component Analsysis

  • Apriori (association rule)

train, test, and validate



validating a model

How do we measure if a model is good?







We will talk more about this later...

but for now focus on

regression performance metrics


We will talk more about this later...

but for now focus on

regression performance metrics


Split the sample in test and training sets 

Train on the training set

Test (measure accuracy) on the test set

R^2 = 1 - rMSE

Cross Validation


validating a model

from sklearn.model_selection import train_test_split

def line(x, intercept, slope):
    return slope * x + intercept

def chi2(args, x, y, s):
    a, b = args
    return sum((y - line(x, a, b))**2 / s)

x_train, x_test, y_train, y_test, s_train, s_test = train_test_split(
     x, y, s, test_size=0.25, random_state=42)

initialGuess = (10, 1)

chi2Solution_goodsplit = minimize(chi2, initialGuess, 
	args=(x_train, y_train, s_train))

print("best fit parameters from the minimization of the chi squared: " + 
       "slope {:.2f}, intercept {:.2f}".format(*chi2Solution_goodsplit.x))

print("R square on training set: ", Rsquare(chi2Solution_goodsplit.x, x_train, y_train))
print("R square on test set: ", Rsquare(chi2Solution_goodsplit.x, x_test, y_test))

ML standard

In ML models need to be "validated":

  1. split the data into a training and a test set (typical split 70/30). 
  2. learn the model parameters by "training" the model on the training set
  3. "test" the model on the test set: measure the accuracy of the prediction (e.g. as the distance between the prediction and the test data).

The performance on the model is the performance achieved on the test set.  

An upgrade on this workflow is to create a training, a test, and a validation test. Iterate between training and test to achieve optimal performance, then measure accuracy on the validation set.This is because you can use the test set performance to tune the model hyperparameters (model selection) but then you would report a performance that is tuned on the test set.

a significance performance degradation on the test compared to training set indicates that the model is "overtrained" and does not generalize well.

data ethics


intended and unintended consequences

Data Science is a black box

Models are neutral, data is biased

two dangerous data-ethics myths

Data Science is a black box

machine learning models are

Epistemic transparency



Right to explanation: the scope of a general "right to explanation" is a matter of ongoing debate


Accountability: who is responsible if an algorithm does harm

algorithmic transparency

trivially intuitive

generalized additive models

decision trees


Random Forest

Deep Learning





algorithmic transparency



algorithmic transparency

trivially intuitive

generalized additive models

decision trees


Random Forest

Deep Learning

Accuracy in solving complex problems




algorithmic transparency



algorithmic transparency

trivially intuitive

generalized additive models

decision trees

Deep Learning

number of features that can be effectively included in the model




Random Forest




algorithmic transparency



Accuracy in solving complex problems

algorithmic transparency

trivially intuitive




generalized additive models

decision trees

Deep Learning


Random Forest


algorithmic transparency



Accuracy in solving complex problems

algorithmic transparency


Machine learning: any method that learns parameters from the data


The transparency of an algorithm is proportional to its complexity and the complexity of the data space


The transparency of an algorithm is limited by our own ability and preparedness to interpret it

algorithmic transparency

models are neutral, the bias is in the data

Why does this AI model whitens Obama face?

Simple answer: the data is biased. The algorithm is fed more images of white people

Decide which model is appropriate (depends on data and question)



where is the bias?

1 - model selection

trivially intuitive

generalized additive models

decision trees


Random Forest

Deep Learning





where is the bias?

Decide what your target function is

Machine learning models are functions that "learn" their parameters from the data.

They "learn" by minimizing or maximize some quantity. 


What should you minimize?

2 - cost function

where is the bias?

They "learn" by minimizing or maximize some quantity. 


What should you minimize? 


the hypothetical trolley problem suddenly is real

self-driving cars

2 - cost function

where is the bias?

They "learn" by minimizing or maximize some quantity. 


What should you minimize? 


prosecutorial justice

minimize number of people incarcerated injustly

maximize public safety


2 - cost function

Explore the data


discover some of the bias

(trust me, there is more!)

it's not easy

there's covariance 

missing data

where is the bias?

3 - data selection and preparation

remove the bias...

(few try)

models are neutral, the bias is in the data

Should AI reflect

who we are

(and enforce and grow our bias)

or should it reflect who we aspire to be?

(and who decides what that is?)

models are neutral, the bias is in the data

The bias is in the data

Should AI reflect

who we are

(and enforce and grow our bias)

or should it reflect who we aspire to be?

(and who decides what that is?)

models are neutral, the bias is in the data

The bias is in the data

The bias is in the models and the decision we make

Should AI reflect

who we are

(and enforce and grow our bias)

or should it reflect who we aspire to be?

(and who decides what that is?)

models are neutral, the bias is in the data

The bias is in the data

The bias is in the models and the decision we make

The bias is in how we choose to optimize our model

Should AI reflect

who we are

(and enforce and grow our bias)

or should it reflect who we aspire to be?

(and who decides what that is?)

models are neutral, the bias is in the data

The bias is in the data

The bias is in the models and the decision we make

The bias is in how we choose to optimize our model

Should AI reflect

who we are

(and enforce and grow our bias)

or should it reflect who we aspire to be?

(and who decides what that is?)

The bias is society that provides the framework to validate our biased models

models are neutral, the bias is in the data

The bias is in the data

The bias is in the models and the decision we make

The bias is in how we choose to optimize our model

The bias is society that provides the framework to validate our biased models

Should AI reflect

who we are

(and enforce and grow our bias)

or should it reflect who we aspire to be?

(and who decides what that is?)

key concepts


  • Machine Learning models are parametrized representation of "reality"  where the parameters are learned from finite sets of realizations of that reality
  • Unsupervised learning: all variables observed for all data, looking for natural grouping of datapoints in the N-dim space
  • Supervised learning: a target variable is known for (a subset of) the data and the goal is to predict it for new (the rest of the) data


  • epistemic transparency:not all models are the same
  • there is a tradeoff between epistemic transparency and the ability to handle complex data
  • The bias enter data science in (at least) data; model selection; target function and optimization choices; validation




