PHC6937 Public Health Research Methods

Machine Learning

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

huihu@ufl.edu

April 9, 2020

Introduction

Bias-Variance Trade-off

Regularization

Decision Trees and Ensembles

Neural Networks and Deep Learning

Introduction

Inferential Model

vs

Predictive Model

The Limits of Traditional Computer Programs

Image from MNIST handwritten digit dataset

A zero that is difficult to distinguish from a six algorithmically

How to distinguish between threes and fives?
Or between fours and nines?

We don't know what program to write because we don't know how it's done by our brains

Machine Learning

Many things we learn in school have a lot in common with traditional computer programs:
how to multiply numbers, solve equations, take derivatives
The things we learn at an extremely early age, the things we find most natural, are learned by example, not by formula:
recognize a dog
In other words, when we were born, our brains provided us with a model that described how we would be able to see the world
- as we grew up, that model would take in our sensory inputs and make a guess about what we're experiencing
- if that guess is confirmed by our parents, our model would be reinforced
- over our lifetime, our model becomes more and more accurate as we assimilate billions of examples

Machine Learning

Machine learning is predicated on this idea of learning from example
Instead of teaching a computer a massive list rules to solve the problem, we give it a model with which it can evaluate examples and a small set of instructions to modify the model when it makes a mistake

Let's define our model to be a function

h(x,\theta)

Another Example

To predict exam performance (above or below average) based on the number of hours of sleep we get and the number of hours we study in the previous day
We collect a lot of data
Our goal might be to learn a model with parameter vector
such that:

h(x,\theta)

A linear perceptron

Deep Learning

As we move on to much complex problems, our data
- not only becomes extremely high dimensional
- the relationships we want to capture also become highly nonlinear
To accommodate this complexity, recent research in machine learning has attempted to build models that highly resemble the structures utilized by our brains
Commonly referred to as deep learning, which has had spectacular success in tackling problems in computer vision and natural language processing
- not only far surpass other kinds of machine learning algorithm
- also rival the accuracies achieved by humans

Bias-Variance Trade-off

Decomposition of Error

The underlying relationship between X and Y:
We usually estimate f() for two reasons:
- prediction
- inference
The prediction model:
We can decompose the squared difference between the predicted value and the actual value of Y:

- reducible error:
- irreducible error:
We aim to generate models that can minimize the reducible error.

Bias-Variance Trade-off

Cross-Validation

The test error can be calculated if a test dataset is available.
Unfortunately, this is usually not the case.
We don’t have a very large designated test dataset that can be used to directly estimate the test error in most time.
Cross-validation:
A method that can estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the trained method to those held out observations

Training

80%

Testing

20%

k-fold CV

Tune Models

Evaluate Performance

Raw Data

Data Engineering

Explore

Model Selection

Feature Engineering

Train Model

Evaluate Performance

Data Product

Machine Learning Pipeline

Regularization

Linear Regression with One Variable

The hypothesis function:

Cost function:

measures the accuracy of our hypothesis function by using a cost function.
takes an average of all the results of the hypothesis with inputs from x's compared to the actual output y's

This function is otherwise called the "Squared error function", or "Mean squared error".
The mean is halved as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.

Gradient Descent

Now we need to estimate the parameters in hypothesis function.

The way we do this is by taking the derivative of our cost function.
The slope of the tangent is the derivative at that point and it will give us a direction to move towards.
We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter α, which is called the learning rate.

Gradient Descent

The gradient descent algorithm is:

repeat until convergence:

where j=0,1 represents the feature index number.

Gradient Descent for Linear Regression:

Why Gradient Descent?

Normal Equation:

Gradient Descent	Normal Equation
Need to choose alpha	No need to choose alpha
Needs many iterations	No need to iterate

Works well when n is large	Slow if n is very large

For large datasets, we usually use stochastic gradient descent.

Regularization

High bias or underfitting:
- when the form of our hypothesis function maps poorly to the trend of the data.
- It is usually caused by a function that is too simple or uses too few features.
High variance or overfitting:
- caused by a hypothesis function that fits the available data well but does not generalize well to predict new data.
- It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
Two main options to address overfitting:
- Reduce the number of features (manually select which features to keep/use a model selection algorithm)
- Regularization (Keep all the features, but reduce the parameters θ)

Regularization

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

We want to make it more quadratic

We'll want to eliminate the influence of the cubic and quartic terms.

Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

In general:

L2 regularization (Ridge)

Regularization

L1 regularization (Lasso):

L2 regularization (Ridge):

L1+L2 regularizations (Elastic net):

Comparisons

L1 regularization helps perform feature selection in sparse feature spaces
L1 rarely perform better than L2
- when two predictors are highly correlated, L1 regularizer will simply pick one of the two predictors
- in contrast, the L2 regularizer will keep both of them and jointly shrink the corresponding coefficients a little bit
Elastic net has proved to be (in theory and in practice) better than L1/Lasso

Decision Trees and Ensembles

Modern name: Classification and Regression Trees (CART)
The CART algorithm provides a foundation for important algorithms such as bagged decision trees, random forests, and boosted decision trees

CART Model Representation

Binary tree
Node: a single input variable (x) and a split point on that variable
Leaf node: an output variable (y)

Making Predictions

Evaluate the specific input started at the root node of the tree
Partitioning of the input space
e.g. height=160cm, weight=65kg
- Height>180cm: No
- Weight>80kg: No
- Therefore: Female

Learn a CART Model from Data

Creating a binary decision tree is actually a process of dividing up the input space
A greedy approach is used: recursive binary splitting
- all the values are lined up and different split points are tried and tested using a cost function
- the split with the best cost is selected
Cost functions:
- regression: the sum of squared error

- classification: the Gini cost

Stopping Criterion

The recursive binary splitting procedure needs to know when to stop splitting as it works its way down the tree with the training data
Most common stopping procedure:
- set a minimum count on the number of training instances assigned to each leaf node
- defines how specific to the training data the tree will be
- too specific (e.g. 1) will lead to overfit
- needs to be tuned

Pruning the Tree

Pruning can be used after the tree is learned to further lift performance
The complexity of a decision tree is defined as the number of splits in the tree
Simple trees are preferred
- easy to understand
- less likely to overfit your data
Work through each leaf node in the tree and evaluate the effect of removing it
- leaf nodes are removed only if it results in a drop in the overall cost function

Decision Tree Algorithm Visualization

Ensembles (combine models) can give you a boost in prediction accuracy
Three most popular ensemble methods:
- Bagging: build multiple models (usually the same type) from different subsamples of the training dataset
- Boosting: build multiple models (usually the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models
- Voting: build multiple models (usually different types) and simple statistics (e.g. mean) are used to combine predictions

Ensembles

Bagging

Take multiple samples from your training dataset (with replacement) and train a model for each sample
The final output prediction is averaged across the predictions of all of the sub-models
Performs best with algorithms that have high variance (e.g. decision trees)
Run in parallel because each bootstrap sample does not depend on others
Common algorithms:
- bagged decision trees
- random forest
with reduced correlation between individual classifiers
a random subset of features are considered for each split
- extra trees
further reduce correlation between individual classifiers
cut-point is selected fully at random, independently of the outcome

Bootstrap Aggregation

Boosting

Creates a sequence of models that attempt to correct the mistakes of the models before them in the sequence
Build a model from the training data, then create a second model that attempts to correct the errors from the first model
Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a final output prediction
Models are added until the training set is predicted perfectly or a maximum number of models are added
Works in sequential manner

Common Algorithms

AdaBoost (Adaptive Boosting)

Weight instances in the dataset by how easy or difficult they are to predict
Allow the algorithm to pay more or less attention to them in the construction of subsequent models

Gradient Boosting (Stochastic Gradient Boosting)

Boosting algorithms as iterative functional gradient descent algorithms
At each iteration of the algorithm, a base learner is fit on a subsample of the training set drawn at random without replacement

AdaBoost (Adaptive Boosting)

Initialize observation weights:
For m=1 to M
- fit a classifier Gm(x) to training data

- compute:

- compute:

- set:

w_i=1/N

err_m= {{\sum^N_{i=1}w_iI(y_i \ne G_m(x_i))}\over {\sum^N_{i=1}w_i}}

\alpha_m=log({1-err_m\over err_m})

w_i<-w_i\times exp[\alpha_m \times I(y_i \ne G_M(x_i))], i=1,2,...,N

AdaBoost (Adaptive Boosting)

Iteration 1

Iteration 2

Iteration 3

Final Model

Intuitive sense: weights will be increased for incorrectly classified observation

- give more focus to next iteration
- weights will be reduced for correctly classified observation

Gradient Boosting

Instead of reweighting observations in adaptive boosting, gradient boosting make some corrections to prediction errors directly
Learn a model -> compute the error residual -> learn to predict the residual

Initial model

Compute residuals

Model residuals

Combinations

...

Gradient Boosting

Learn sequence of models
Combination of models is increasingly accurate and increasingly complex

Model predictions

Residuals

...

Neural Networks and

Deep Learning

Michael A. Nielson, Neural Networks and Deep Learning, Determination Press, 2015

Neural Networks

A field of study that investigates how simple models of biological brains can be used to solve difficult computational tasks (i.e. predictive modeling in machine learning)
- the goal is not to create realistic model of the brain
- develop robust algorithms and data structures that can be used to model difficult problems
NNs are capable of learning any mapping function and have been proven to be a universal approximation algorithm
- the predictive capability of NNs comes from the hierarchical/multilayered structure of the networks
- the data structure can pick out features at different scales or resolutions and combine them into higher-order features

Perceptron

Perceptron:
- a single neuron model that was a precursor to larger neural networks
- neuron
- neuron weights
- activation

Neuron: the building block for NNs
- simple computational units that have weighted input signals and produce an output signal using an activation function

Perceptron

Neuron weights:
- similar to the coefficients used in a regression equation
- like linear regression, each neuron also has a bias which can be thought of as an input that always has the value 1.0 and it too must be weighted (e.g. a neuron may have 2 inputs, and it requires 3 weights)
- weights are often initialized to small random values (i.e. 0~0.3)
Activation:
- the weighted inputs are summed and passed through an activation function (also called a transfer function)
- it governs the threshold at which the neuron is activated and the strength of the output signal
- historically, simple step activation functions were used (e.g. if the summed input was above a threshold, say 0, then the neuron would output a value of 1, otherwise, output a -1)

Expressing Linear Perceptrons

Feed-forward Neural Networks

Single neurons are not expressive enough to solve complicated learning problems
The neurons in the human brain are organized in layers
- information flows from one layer to another
- sensory input is converted into conceptual understanding

Feed-forward NNs:

connections only traverse from a lower layer to a higher layer
no connections between neurons in the same layer
no connections that transmit data from a higher layer to a lower layer

Linear Activation

Linear activation is easy to compute with, but has serious limitations
Any feed-forward NN consisting of only linear activation can be expressed as a NN with no hidden layers

Nonlinear Activation

In order to learn complex relationships, we need to use activation functions that employ some sort of nonlinearity

Logistic function / Sigmoid function: 0~1

f(z)=1/(1+e^{-z})

f(z)=Tanh(z)

f(z)=max(0,z)

Hyperbolic tangent (Tanh) function: -1~1

ReLU (rectified linear unit) function

Networks of Neurons

Neurons are arranged into networks of neurons:
- a row of neurons is called a layer
- the architecture of the neurons in the network is often called the network topology
Input or Visible Layer:
- the bottom layer, which takes input from the dataset
- usually with one neuron per feature in the dataset
Hidden Layers:
- not directly exposed to the input
- the simplest network structure is to have a single neuron in the hidden layer that directly outputs the value
- deep learning can refer to having many hidden layers in NN
Output Layer:
- the final hidden layer
- the choice of activation function in the output layer is constrained by the type of problem that you are modeling
- e.g. single output neuron with no activation function for regression problem, single output neuron with a sigmoid function for binary classification problem, softmax output layer for multiclass classification problem

Training Networks

Data Preparation:
- data must be numerical (one-hot encoding for categorical features)
- NNs require the input to be scaled in a consistent way (e.g. normalization)
Stochastic Gradient Descent:
- classical training algorithm for NNs
- one row of data is exposed to the network at a time as input
- the network processes the input upward activating neurons as it goes to finally produce an output value (forward propagation)
- the output of the network is compared with the expected output and an error is calculated
- this error is then propagated back through the network, one layer at a time, and the weights are updated according to the amount that they contributed to the error (back propagation algorithm)
- the process is repeated for all of the examples in training data
- one round of updating the network for the entire training dataset is called an epoch

Prediction

Once a NN has been trained, it can be used to make predictions
The network topology and the final set of weights is all that you need to save from the model
Predictions are made by providing the input features to the network and performing a forward-pass allowing it to generate an output that you can use as a prediction

Convolutional Neural Network

The flattening of the image matrix of pixels to a long vector of pixel values looses all of the spatial structure in the image

C

-'C'

-?

Does color matter?

No, only the structure matters

Preserve the spatial relationship between pixels by learning internal feature representations using small squares of input data
Features are learned and used across the whole image
- allowing for the objects in the images to be shifted or translated in the scene and still detectable by the network
Advantages of CNNs:
- fewer parameters to learn than a fully connected network
- designed to be invariant to object position and distortion in the scene
- automatically learn and generalize features from the input domain

CNN

Building Blocks of CNNs

Recurrent Neural Network

Sequences

Time-series data:
- e.g. price of a stock over time
Classical feedforward NN:
- define a window size (e.g. 5)
- train the network to learn to make short term predictions from the fixed sized window of inputs
- limitations: how to determine the window size
Different types of sequence problems:
- one-to-many: sequence output, for image captioning
- many-to-one: sequence input, for sentiment classification
- many-to-many: sequence in and out, for machine translation
- synchronized many to many: synced sequences in and out, for video classification

RNNs

RNNs are a special type of NN designed for sequence problems
A RNN can be thought of as the addition of loops to the archetecture of a standard feedforward NN
- the output of the network may feedback as an input to the network with the next input vector, and so on
The recurrent connections add state or memory to the network and allow it to learn broader abstractions from the input sequences

PHC6937 Public Health Research Methods

Machine Learning

Introduction Bias-Variance Trade-off

Regularization

Decision Trees and Ensembles

Neural Networks and Deep Learning

Introduction

Inferential Model

vs

Predictive Model

The Limits of Traditional Computer Programs

Machine Learning

Machine Learning

Another Example

Deep Learning

Bias-Variance Trade-off

Decomposition of Error

Bias-Variance Trade-off

Cross-Validation

Machine Learning Pipeline

Regularization

Linear Regression with One Variable

Gradient Descent

Gradient Descent

Why Gradient Descent?

Regularization

Regularization

Regularization

Comparisons

Decision Trees and Ensembles

CART Model Representation

Making Predictions

Learn a CART Model from Data

Stopping Criterion

Pruning the Tree

Decision Tree Algorithm Visualization

Ensembles

Bagging

Boosting

Common Algorithms

AdaBoost (Adaptive Boosting)

AdaBoost (Adaptive Boosting)

Gradient Boosting

Gradient Boosting

Neural Networks and

Deep Learning

Neural Networks

Perceptron

Perceptron

Expressing Linear Perceptrons

Feed-forward Neural Networks

Linear Activation

Nonlinear Activation

Networks of Neurons

Training Networks

Prediction

Convolutional Neural Network

C

C

C

C

C

-'C'

-'C'

-'C'

-'C'

-?

CNN

Building Blocks of CNNs

Recurrent Neural Network

Sequences

RNNs

Reading

Introduction

Bias-Variance Trade-off