foundations of data science for everyone

VI: Logistic Regression
AUTHOR AND LECTURER: Farid Qamar 

0

optimizing the objective function

what is a model?

in the ML context:

a model is a low dimensional representation of a higher dimensionality dataset

recall:

what is a machine learning?

ML: Any model with parameters learned from the data

what is a machine learning?

ML: Any model with parameters learned from the data

ML models are a parameterized representation of "reality" where the parameters are learned from finite sets (samples) of realizations of that reality (population)

how do we model?

Choose the model:

a mathematical formula to represent the behavior in the data

1

example: line model y = a x + b

parameters

how do we model?

Choose the model:

a mathematical formula to represent the behavior in the data

1

example: line model y = a x + b

parameters

Choose the hyperparameters:

parameters chosen before the learning process, which govern the model and training process

example: the degree N of the polynomial

y = \sum^N_{i=0}c_i x^i

how do we model?

Choose an objective function:

in order to find the "best" parameters of the model: we need to "optimize" a function.

We need something to be either MINIMIZED or MAXIMIZED

2

example:

line model: y = a x + b

parameters

objective function: sum of residual squared (least square fit method)

SSE = \sum(y_{i,observed}-y_{i,predicted})^2
SSE = \sum(y_{i,observed}-(ax_i+b))^2

we want to minimize SSE as much as possible

Optimizing the Objective Function

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

Minimum (optimal) SSE

     a = 4

Optimizing the Objective Function

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

How do we find the minimum if we do not know beforehand how the SSE curve looks like?

Optimizing the Objective Function

Minimum (optimal) SSE

     a = 4

1

regression vs classification

General purpose for ML

  • to understand the structure of feature space
  • regression: to predict unknown values based on known examples
  • classification: to identify unknown classes based on known examples
  • feature importance: to understand which features are important for the success of the model

Linear Regression

Regression example:

find the optimal parameters (slope/coefficients and intercept) of a linear model that best combine the features (independent variables) to describe the target (dependent variable)

Linear Regression

Regression example:

find the optimal parameters (slope/coefficients and intercept) of a linear model that best combine the features (independent variables) to describe the target (dependent variable)

World Bank: Life expectancy at birth in the US

Linear Regression

Regression example:

find the optimal parameters (slope/coefficients and intercept) of a linear model that best combine the features (independent variables) to describe the target (dependent variable)

Linear Regression

Regression example:

find the optimal parameters (slope/coefficients and intercept) of a linear model that best combine the features (independent variables) to describe the target (dependent variable)

line model y = ax + b

Linear Regression

Regression example:

find the optimal parameters (slope/coefficients and intercept) of a linear model that best combine the features (independent variables) to describe the target (dependent variable)

line model y = ax + b

what if our data is represented by a binary value?

what if our data is represented by a binary value?

what if our data is represented by a binary value?

try fitting a linear model...

what if our data is represented by a binary value?

try fitting a linear model...

what if our data is represented by a binary value?

try fitting a linear model...

what if our data is represented by a binary value?

try fitting a linear model...

what if our data is represented by a binary value?

try fitting a linear model...

what if we add more data points?

what if our data is represented by a binary value?

try fitting a linear model...

what if we add more data points?

what if our data is represented by a binary value?

try fitting a linear model...

what if we add more data points?

2

logistic regression

the Logistic Function:

f(x) = \frac{1}{1+e^{-z}}
z=ax+b

;

interpreted as the probability that the target is True (= 1)

Objective Function:

Log-likelihood

\log(\mathscr{L}) = \sum(y_i \log(f)+(1-y_i)\log(1-f))

the Logistic Function:

f(x) = \frac{1}{1+e^{-z}}
z=ax+b

;

interpreted as the probability that the target is True (= 1)

Objective Function:

Log-likelihood

\log(\mathscr{L}) = \sum(y_i \log(f)+(1-y_i)\log(1-f))

the Logistic Function:

f(x) = \frac{1}{1+e^{-z}}

;

interpreted as the probability that the target is True (= 1)

Objective Function:

Log-likelihood

(Sigmoid)

\log(\mathscr{L}) = \sum(y_i \log(f)+(1-y_i)\log(1-f))
z=ax+b

the Logistic Function:

f(x) = \frac{1}{1+e^{-z}}

;

interpreted as the probability that the target is True (= 1)

Objective Function:

Log-likelihood

\log(\mathscr{L}) = \sum(y_i \log(f)+(1-y_i)\log(1-f))
z=ax+b

the Logistic Function:

f(x) = \frac{1}{1+e^{-z}}

;

interpreted as the probability that the target is True (= 1)

what if we add more data points?

z=ax+b

the Logistic Function:

f(x) = \frac{1}{1+e^{-z}}

;

interpreted as the probability that the target is True (= 1)

what if we add more data points?

z=ax+b

3

classification model evaluation

Confusion Matrix

indicates the model's "confusion" between classification outcomes

class 0

class 1

class 0

class 1

Predicted

Actual

smaller off-diagonal elements &

larger diagonal elements

=

model more effective at correctly labeling classes

Confusion Matrix

indicates the model's "confusion" between classification outcomes

class 0

class 1

class 0

class 1

Predicted

Actual

smaller off-diagonal elements &

larger diagonal elements

=

model more effective at correctly labeling classes

class 2

class 2

Confusion Matrix

indicates the model's "confusion" between classification outcomes

negative

positive

negative

positive

Predicted

Actual

smaller off-diagonal elements &

larger diagonal elements

=

model more effective at correctly labeling classes

for example...

model predicting 500 objects:

232

4

1

263

True/False Positives/Negatives

Predicted

Actual

232

4

1

263

TN

FP

TP

FN

negative

positive

negative

positive

Classification outcomes:

    true positives    (TP) : "+" correctly labeled as "+"

    true negatives  (TN) : "-" correctly labeled as "-"

    false positives   (FP) : "-" incorrectly labeled as "+"

    false negatives (FN) : "+" incorrectly labeled as "-"

Accuracy

Predicted

Actual

232

4

1

263

TN

FP

TP

FN

negative

positive

negative

positive

Classification outcomes:

    true positives    (TP) : "+" correctly labeled as "+"

    true negatives  (TN) : "-" correctly labeled as "-"

    false positives   (FP) : "-" incorrectly labeled as "+"

    false negatives (FN) : "+" incorrectly labeled as "-"

accuracy:

\frac{TP+TN}{N} = \frac{TP+TN}{TP+FP+TN+FN}

accuracy =

\frac{232+263}{500}=99\%

Precision and Recall

Classification outcomes:

    true positives    (TP) : "+" correctly labeled as "+"

    true negatives  (TN) : "-" correctly labeled as "-"

    false positives   (FP) : "-" incorrectly labeled as "+"

    false negatives (FN) : "+" incorrectly labeled as "-"

Predicted

Actual

232

4

1

263

TN

FP

TP

FN

negative

positive

negative

positive

precision:

recall:

\frac{TP}{TP+FP}
\frac{TP}{TP+FN}

precision =

\frac{263}{263+4}=98.5\%

recall =

\frac{263}{263+1}=99.6\%

Precision and Recall

precision:

(or specificity)

recall:

(or sensitivity)

\frac{TP}{TP+FP}
\frac{TP}{TP+FN}

Fraction of objects you think are positive that actually are positive

Fraction of positive objects that you were able to find

F1-score:

\frac{2\times\text{ precision }\times\text{ recall}}{\text{precision }+\text{ recall}}

Current classifier accuracy: 50%

 

Precision?

 

Recall?

 

Specificity?

 

Sensitivity?

Current classifier accuracy: 50%

 

Precision = 4/6 = 0.7

 

Recall = 4/8 = 0.5

 

Specificity = 0.7

 

Sensitivity = 0.5

encoding categorical variables

4

Categorical Variable

variable that can take a finite number of values.

spicies age weight
dog 7 32.3
bird 1 0.3
cat 3 8.1

continuous

Categorical Variable

variable that can take a finite number of values.

spicies age weight
dog 7 32.3
bird 1 0.3
cat 3 8.1

ordinal

Categorical Variable

variable that can take a finite number of values.

spicies age weight
dog 7 32.3
bird 1 0.3
cat 3 8.1

continuous

categorical

ordinal

Categorical Variable

variable that can take a finite number of values.

spicies age weight
dog 7 32.3
bird 1 0.3
cat 3 8.1

continuous

change categorical to (integer) numerical

one-hot encoding

spicies age weight
1 7 32.3
2 1 0.3
3 3 8.1

change each category to a binary

cat bird dog age weight
0 0 1 7 32.3
0 1 0 1 0.3
1 0 0 3 8.1

numerical encoding

implies an order that does not exist

change categorical to (integer) numerical

spicies age weight
1 7 32.3
2 1 0.3
3 3 8.1

numerical encoding

dog=1, bird=2, cat=3

...dog < bird < cat... ??

ignores covariance between features

increases the dimensionality

one-hot encoding

change each category to a binary

cat bird dog age weight
0 0 1 7 32.3
0 1 0 1 0.3
1 0 0 3 8.1

problematic if you are interested in feature importance

change categorical to (integer) numerical

one-hot encoding

spicies age weight
1 7 32.3
2 1 0.3
3 3 8.1

change each category to a binary

cat bird dog age weight
0 0 1 7 32.3
0 1 0 1 0.3
1 0 0 3 8.1

numerical encoding

implies an order that does not exist

Definitely Preferred!

 

ignores covariance between features

increases the dimensionality

problematic if you are interested in feature importance

normalization

5

Data can have covariance (and it almost always does!)

PLUTO Manhattan data (42,000 x 15)

axis 1 -> features

axis 0  -> observations

COVARIANCE = correlation / variance

Data can have covariance (and it almost always does!)

Data can have covariance (and it almost always does!)

Pearson's correlation (linear correlation)

{\displaystyle r_{xy}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}

Generic preprocessing... WHY??

Clustering without scaling:

only the variable with more spread matters

Skewed data distribution:

std(x) ~ range(y)

Generic preprocessing... WHY??

Clustering without scaling:

only the variable with more spread matters

Skewed data distribution:

std(x) ~ range(y)

Clustering

 

Classifying &

regression

 

Unsupervised learning

  • understanding structure  
  • anomaly detection
  • dimensionality reduction

Supervised learning

  • classification
  • prediction
  • feature selection

unsupervised vs supervised learning

Data that is not correlated appear as a sphere in the Ndimensional feature space

Data can have covariance (and it almost always does!)

ORIGINAL DATA

STANDARDIZED DATA

 

Generic preprocessing

Generic preprocessing... WHY??

Worldbank Happyness Dataset

Classification/Clustering without scaling:

only the variable with more spread matters

Generic preprocessing... WHY??

Worldbank Happyness Dataset

Classification/Clustering without scaling:

only the variable with more spread matters

Classification/Clustering

after scaling:

both variables matter equally

Data that is not correlated appear as a sphere in the Ndimensional feature space

Data can have covariance (and it almost always does!)

ORIGINAL DATA

STANDARDIZED DATA

 

Generic preprocessing

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

Generic preprocessing: most commonly, we will just correct for the spread and centroid

whitening

The term "whitening" refers to white noise, i.e. noise with the same power at all frequencies"

PLUTO Manhattan data (42,000 x 15) correlation matrix

axis 1 -> features

axis 0  -> observations

Data can have covariance (and it almost always does!)

PLUTO Manhattan data (42,000 x 15) correlation matrix

A covariance matrix is diagonal if the data has no correlation

Data can have covariance (and it almost always does!)

Full On Whitening

find the matrix W that diagonalized Σ

from zca import ZCA
import numpy as np
X = np.random.random((10000, 15)) # data array
trf = ZCA().fit(X)
X_whitened = trf.transform(X)
X_reconstructed =
trf.inverse_transform(X_whitened)
assert(np.allclose(X, X_reconstructed))

: remove covariance by diagonalizing the transforming the data with a matrix that diagonalizes the covariance matrix

this is at best hard, in some cases impossible even numerically on large datasets

Generic preprocessing: other common schemes

for image processing (e.g. segmentation) often you need to mimmax preprocess

from sklearn import preprocessing

Xopscaled = preprocessing.minmax_scale(image_pixels.astype(float), axis=1) 
Xopscaled.reshape(op.shape)[200, 700]

before

after (looks the same but colorbar different)

-107

273

0

1

reading

HW

1

.

1

stochastic gradient descent (SGD)

the algorithm: Stochastic Gradient Descent (SGD)

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

-1

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

for a line model   y = ax + b 

we need to find the "best" parameters a and b

1. choose initial value for a & b

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

for a line model   y = ax + b 

we need to find the "best" parameters a and b

1. choose initial value for a & b

2. calculate the SSE

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

Things to consider:

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

 

local minima

global minimum

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

initialization: choosing starting spot?

global minimum

local minima

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

initialization: choosing starting spot?

learning rate: how far to step?

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

initialization: choosing starting spot?

learning rate: how far to step?

stopping criterion: when to stop?

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

initialization: choosing starting spot?

learning rate: how far to step?

stopping criterion: when to stop?

Stochastic Gradient Descent (SGD): use a different (random) sub-sample of the data at each iteration

2

multiple linear regression

ML terminology

World Bank: Life expectancy at birth in the US

ML terminology

World Bank: Life expectancy at birth in the US

ML terminology

World Bank: Life expectancy at birth in the US

model

ML terminology

World Bank: Life expectancy at birth in the US

object

model

ML terminology

World Bank: Life expectancy at birth in the US

object

feature

model

ML terminology

World Bank: Life expectancy at birth in the US

object

feature

target

model

ML terminology

objects

features

target

ML terminology

features

target

objects

Simple Linear Regression

1  feature

1  target

2  parameters

y = ax + b

Simple Linear Regression

1  feature

1  target

2  parameters

y = ax + b
y = \beta_0 + \beta_1x_1

Simple Linear Regression

1  feature

1  target

2  parameters

y = ax + b
y = \beta_0 + \beta_1x_1

Multiple Linear Regression

n  features

1  target

Simple Linear Regression

1  feature

1  target

2  parameters

y = ax + b
y = \beta_0 + \beta_1x_1

Multiple Linear Regression

y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + ... + \beta_nx_n

n  features

1  target

n+1 parameters

Simple Linear Regression

1  feature

1  target

2  parameters

y = ax + b
y = \beta_0 + \beta_1x_1

Multiple Linear Regression

y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + ... + \beta_nx_n
y = \sum_{i=0}^n \beta_ix_i
x_0 = \vec{1}

;

n  features

1  target

n+1 parameters

Foundations of Data Science for Everyone - VI : Logistic Regression

By federica bianco

Foundations of Data Science for Everyone - VI : Logistic Regression

  • 307