Andrew Beam, PhD

Department of Epidemiology

HSPH

twitter: @AndrewLBeam

Let's say we'd like to have a single neuron learn a function

y

X_1

X_2

X1 | X2 | y |
---|---|---|

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

w_2

w_1

b

Observations

How do we make a prediction for each observations?

y

X_1

X_2

X1 | X2 | y |
---|---|---|

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 1 |

w_2

w_1

b

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

Observations

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

X_1 = 0, X_2 = 0, y =0

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

X_1 = 0, X_2 = 0, y =0

First compute the weighted sum:

h = w_1*X_1 + w_2*X_2 + b

h = 1*0 + -1*0 + -0.5 = -0.5

h = -0.5

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

X_1 = 0, X_2 = 0, y =0

First compute the weighted sum:

h = w_1*X_1 + w_2*X_2 + b

h = 1*0 + -1*0 + -0.5

h = -0.5

Transform to probability:

p = \frac{1}{1+\exp(-h)}

p = \frac{1}{1+\exp(-0.5)}

p = 0.38

For the first observation:

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

X_1 = 0, X_2 = 0, y =0

First compute the weighted sum:

h = w_1*X_1 + w_2*X_2 + b

h = 1*0 + -1*0 + -0.5

h = -0.5

Transform to probability:

p = \frac{1}{1+\exp(-h)}

p = \frac{1}{1+\exp(-0.5)}

p = 0.38

Round to get prediction:

\hat{y} = round(p)

\hat{y} = 0

Putting it all together:

h = w_1*X_1 + w_2*X_2 + b

p = \frac{1}{1+\exp(-h)}

\hat{y} = round(p)

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

X1 | X2 | y | h | p | |
---|---|---|---|---|---|

0 | 0 | 0 | -0.5 | 0.38 | 0 |

0 | 1 | 1 | |||

1 | 0 | 1 | |||

1 | 1 | 1 |

\hat{y}

Fill out this table

Putting it all together:

h = w_1*X_1 + w_2*X_2 + b

p = \frac{1}{1+\exp(-h)}

\hat{y} = round(p)

Assume we have the following values

w1 | w2 | b |
---|---|---|

1 | -1 | -0.5 |

X1 | X2 | y | h | p | |
---|---|---|---|---|---|

0 | 0 | 0 | -0.5 | 0.38 | 0 |

0 | 1 | 1 | -1.5 | 0.18 | 0 |

1 | 0 | 1 | 0.5 | 0.62 | 1 |

1 | 1 | 1 | -0.5 | 0.38 | 0 |

\hat{y}

Fill out this table

Our neural net isn't so great... how do we make it better?

What do I even mean by better?

Let's define how we want to *measure* the network's performance.

There are many ways, but let's use *squared-error*:

(y - p)^2

Let's define how we want to *measure* the network's performance.

There are many ways, but let's use *squared-error*:

Now we need to find values for that make this error as small as possible

(y - p)^2

w_1, w_2, b

Our task is *learning *values for such the the difference between the predicted and actual values is as small as possible.

w_1, w_2, b

So, how we find the "best" values for

w_1, w_2, b

Recall (without PTSD) that the *derivative* of a function tells you how it is changing at any given location.

If the derivative is positive, it means it's going up.

If the derivative is negative, it means it's going down.

Simple strategy:

- Start with initial values for

- Take partial derivatives of loss function

with respect to

- Subtract the derivative (also called the *gradient)* from each

w_1, w_2, b

w_1, w_2, b

Simple strategy:

- Start with initial values for

- Take partial derivatives of loss function

with respect to

- Subtract the derivative (also called the *gradient)* from each

w_1, w_2, b

w_1, w_2, b

To the whiteboard!

gw_1 = (p - y)*(p*(1-p)*X_1)

gw_2 = (p - y)*(p*(1-p)*X_2)

g_b = (p - y)*(p*(1-p))

Gradient for

Gradient for

Gradient for

w^{new}_1 = w^{old}_1 - \sum gw_1

Update for

Update for

Update for

w^{new}_2 = w^{old}_2 - \sum gw_2

b^{new} = b^{old} - \sum g_b

w_1

w_1

w_2

w_2

b

b