introml-sp24-lec6

e.g. linear regression represented as a computation graph

Each data point incurs a loss of $(w^Tx^{(i)} + w_0 - y^{(i)})^2$
Repeat for each data point, sum up the individual losses
Gradient of the total loss gives us the "signal" on how to optimize for $w, w_0$

\nabla \mathcal{L}_{\left(w, w_0\right)}

\nabla \mathcal{L}_{\left(w, w_0\right)}

learnable parameters (weights)

y

y

\mathcal{L}(\cdot)

\mathcal{L}(\cdot)

z

z

f(z)=z

f(z)=z

g

g

\Sigma

\Sigma

f(\cdot)

f(\cdot)

= w^Tx + w_0

= w^Tx + w_0

w_1

w_1

w_m

w_m

w_0

w_0

\dots

\dots

x_1

x_1

x_2

x_2

x_m

x_m

\dots

\dots

Each data point incurs a loss of $- \left(y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right)$
Repeat for each data point, sum up the individual losses
Gradient of the total loss gives us the "signal" on how to optimize for $w, w_0$

\nabla \mathcal{L}_{\left(w, w_0\right)}

\nabla \mathcal{L}_{\left(w, w_0\right)}

learnable parameters (weights)

y

y

\mathcal{L}(\cdot)

\mathcal{L}(\cdot)

z

z

f(z)=\sigma(z)

f(z)=\sigma(z)

g

g

\Sigma

\Sigma

f(\cdot)

f(\cdot)

= \sigma(w^Tx + w_0)

= \sigma(w^Tx + w_0)

w_1

w_1

w_m

w_m

w_0

w_0

\dots

\dots

x_1

x_1

x_2

x_2

x_m

x_m

\dots

\dots

e.g. linear logistic regression (linear classification) represented as a computation graph

We saw that, one way of getting complex input-output behavior is

to leverage nonlinear transformations

\phi\left(\left[x_1, x_2\right]^{\top}\right)=\left[1, x_1, x_2, x_1^2, x_1 x_2, x_2^2\right]^{\top}

\phi\left(\left[x_1, x_2\right]^{\top}\right)=\left[1, x_1, x_2, x_1^2, x_1 x_2, x_2^2\right]^{\top}

x_1

x_1

x_2

x_2

transform

\text{sign}(0+0 x_1+0 x_2+0 x_1^2+4 x_1 x_2+0 x_2^2+0)

\text{sign}(0+0 x_1+0 x_2+0 x_1^2+4 x_1 x_2+0 x_2^2+0)

e.g. use for decision boundary

👆 importantly, linear in $\phi$ , non-linear in $x$

Today (2nd cool idea): "stacking" helps too!

\begin{aligned} & z=w^T x \\ & y=\text{sign}(z) \end{aligned}

\begin{aligned} & z=w^T x \\ & y=\text{sign}(z) \end{aligned}

a_1

a_1

a_2

a_2

a_1

a_1

a_2

a_2

z_3

z_3

y

y

W_1

W_1

W_2

W_2

\begin{aligned} \mathbf{z} & = \mathbf{x}^T \mathbf{W}_1\\ \mathbf{a} & =\text{sign}(\mathbf{z}) \\ z_3 & = \mathbf{a}^T \mathbf{W}_2 \\ y & =\text{sign}\left(z_3\right) \end{aligned}

\begin{aligned} \mathbf{z} & = \mathbf{x}^T \mathbf{W}_1\\ \mathbf{a} & =\text{sign}(\mathbf{z}) \\ z_3 & = \mathbf{a}^T \mathbf{W}_2 \\ y & =\text{sign}\left(z_3\right) \end{aligned}

A single neuron is

the basic operating "unit" in a neural network.
the basic "node" when a neural network is viewed as computational graph.

neuron , a function, maps a vector input $x \in \mathbb{R}^m$ to a scalar output
inside the neuron, circles do function evaluation/computation
$f$ : we engineers choose
$w$ : learnable parameters

$x$ : $m$ -dimensional input (a single data point)
$w$ : weights (i.e. parameters)
$z$ : pre-activation scalar output
$f$ : activation function
$a$ : post-activation scalar output

\dots

\dots

x_1

x_1

x_2

x_2

x_m

x_m

z

z

f(\cdot)

f(\cdot)

a

a

\Sigma

\Sigma

w_1

w_1

w_m

w_m

w_0

w_0

A single layer is

made of many individual neurons.
(# of neurons) = (layer output dimension).
typically, all neurons in one layer use the same activation $f$ (if not; uglier/messier algebra)
typically, no "cross-wire" between neurons. e.g. $z_1$ doesn't influence $a_2$ . in other words, a layer has the same activation applied element-wise. (softmax is an exception to this, details later.)
typically, fully connected. i.e. there's an edge connecting $x_i$ to $z_j,$ for all $i \in \{1,2,3, \dots , m\}; j \in \{1,2,\dots, n\}$ . in other words, all $x_i$ influence all $a_j.$

z_1

z_1

z_2

z_2

z_3

z_3

z_n

z_n

W

W

layer

learnable weights

f_1(\cdot)

f_1(\cdot)

\Sigma

\Sigma

\dots

\dots

x_1

x_1

x_2

x_2

x_m

x_m

f_1(\cdot)

f_1(\cdot)

\Sigma

\Sigma

f_1(\cdot)

f_1(\cdot)

\Sigma

\Sigma

f_1(\cdot)

f_1(\cdot)

\Sigma

\Sigma

\dots

\dots

f_2(\cdot)

f_2(\cdot)

\Sigma

\Sigma

f_2(\cdot)

f_2(\cdot)

\Sigma

\Sigma

\dots

\dots

f_2(\cdot)

f_2(\cdot)

\Sigma

\Sigma

\dots

\dots

W_1

W_1

W_2

W_2

learnable weights

layer

linear combo

activations

input

A (feed-forward) neural network is

Activation function $f$ choices

$\sigma$ used to be popular

firing rate of neuron
$\sigma^{\prime}(z)=\sigma(z) \cdot(1-\sigma(z))$

ReLU is the de-facto activation choice nowadays

\frac{\partial \text{ReLU}(z)}{\partial z}:=\left\{\begin{array}{lll} 0, & \text { if } \quad z<0 \\ 1, & \text { if } \quad \text{otherwise} \end{array}\right.

\frac{\partial \text{ReLU}(z)}{\partial z}:=\left\{\begin{array}{lll} 0, & \text { if } \quad z<0 \\ 1, & \text { if } \quad \text{otherwise} \end{array}\right.

Default choice in hidden layers.
Pro: very efficient to implement, choose to let the gradient be:

Drawback: if strongly in negative region, unit can be "dead" (no gradient).
Inspired variants like elu, leaky-relu.

\operatorname{ReLU}(z)=\left\{\begin{array}{ll} 0 & \text { if } z<0 \\ z & \text { otherwise } \end{array}\\ \\ \right.

\operatorname{ReLU}(z)=\left\{\begin{array}{ll} 0 & \text { if } z<0 \\ z & \text { otherwise } \end{array}\\ \\ \right.

=\max (0, z)

=\max (0, z)

The last layer, the output layer, is special

activation and loss depends on problem at hand
we've seen e.g. regression (one unit in last layer, squared loss).

(output layer)

e.g., say $K=5$ classes

More complicated example: predict one class out of $K$ possibilities

then last layer: $K$ nuerons, softmax activation

=

=

\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)

\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)

f_L\left(\ldots f_2\left(f_1\left(\mathbf{x}^{(i)}, \mathbf{W}_1\right), \mathbf{W}_2\right), \ldots \mathbf{W}_L\right)

f_L\left(\ldots f_2\left(f_1\left(\mathbf{x}^{(i)}, \mathbf{W}_1\right), \mathbf{W}_2\right), \ldots \mathbf{W}_L\right)

How do we optimize

$J(\mathbf{W})=\sum_{i=1} \mathcal{L}\left(f_L\left(\ldots f_2\left(f_1\left(\mathbf{x}^{(i)}, \mathbf{W}_1\right), \mathbf{W}_2\right), \ldots \mathbf{W}_L\right), \mathbf{y}^{(i)}\right)$ though?

Backprop = gradient descent & the chain rule

Recall that, the chain rule says:

For the composed function: $h(\mathbf{x})=f(g(\mathbf{x})),$ its derivative is: $h^{\prime}(\mathbf{x})=f^{\prime}(g(\mathbf{x})) g^{\prime}(\mathbf{x})$