Your first Deep Neural Network
What we saw in the previous chapter?
(c) One Fourth Labs
Repeat slide 5.1 from the previous lecture
What's going to change now ?
(c) One Fourth Labs
Loss
Model
Data
Task
Evaluation
Learning
Real inputs
Non-linear
Task specific loss functions
Real outputs
Back-propagation
What kind of data and tasks have DNNs been used for ?
(c) One Fourth Labs
28x28 Images
255 | ||||||
255 | 183 | |||||
255 | 183 | 95 | ||||
255 | 183 | 95 | 8 | 93 | 196 | 253 |
255 | 183 | 95 | 8 | 93 | 196 | 253 |
254 | 154 | 37 | 7 | 28 | 172 | 254 |
255 | 183 | 95 | 8 | 93 | 196 | 253 |
254 | 154 | 37 | 7 | 28 | 172 | 254 |
252 | 221 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 198 | 253 |
252 | 250 | 187 | 178 | 195 | 253 | 253 |
How can we represent MNIST images as a vector ?
255 | 183 | 95 | 8 | 93 | 196 | 253 |
254 | 154 | 37 | 7 | 28 | 172 | 254 |
252 | 221 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 198 | 253 |
252 | 250 | 187 | 178 | 195 | 253 | 253 |
1 | 183 | 95 | 8 | 93 | 196 | 253 |
254 | 154 | 37 | 7 | 28 | 172 | 254 |
252 | 221 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 198 | 253 |
252 | 250 | 187 | 178 | 195 | 253 | 253 |
1 | 0.72 | 95 | 8 | 93 | 196 | 253 |
254 | 154 | 37 | 7 | 28 | 172 | 254 |
252 | 221 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 198 | 253 |
252 | 250 | 187 | 178 | 195 | 253 | 253 |
1 | 0.72 | 0.37 | 8 | 93 | 196 | 253 |
254 | 154 | 37 | 7 | 28 | 172 | 254 |
252 | 221 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 198 | 253 |
252 | 250 | 187 | 178 | 195 | 253 | 253 |
1 | 0.72 | 0.37 | 0.03 | 0.36 | 0.77 | 0.99 |
254 | 154 | 37 | 7 | 28 | 172 | 254 |
252 | 221 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 198 | 253 |
252 | 250 | 187 | 178 | 195 | 253 | 253 |
1 | 0.72 | 0.37 | 0.03 | 0.36 | 0.77 | 0.99 |
1 | 0.60 | 0.14 | 0.03 | 0.11 | 0.67 | 1 |
252 | 221 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 198 | 253 |
252 | 250 | 187 | 178 | 195 | 253 | 253 |
1 | 0.72 | 0.37 | 0.03 | 0.36 | 0.77 | 0.99 |
1 | 0.60 | 0.14 | 0.03 | 0.11 | 0.67 | 1 |
0.99 | 0.87 | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | 0.78 | 0.99 |
0.99 | 0.98 | 0.73 | 0.69 | 0.76 | 0.99 | 0.99 |
What kind of data and tasks have DNNs been used for ?
(c) One Fourth Labs
28x28 Images
How can we represent MNIST images as a vector ?
\( \left[\begin{array}{lcr} 1.00, 0.72, 0.37 \dots, 0.76, 0.99, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 1.00, 0.85, 0.73 \dots, 0.68, 1.00, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 1.00, 0.76, 0.64 \dots, 0.86, 0.99, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.99, 0.82, 0.26 \dots, 0.53, 0.87, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.73, 0.81, 0.87 \dots, 0.76, 0.79, 0.67 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.84, 0.72, 0.31 \dots, 0.26, 0.51, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 1.00, 1.00, 0.96 \dots, 0.88, 0.79, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.33, 0.52, 0.47 \dots, 0.76, 0.95, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.85, 0.72, 0.97 \dots, 0.86, 0.94, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.84, 0.92, 0.28 \dots, 0.76, 1.0, 0.99 \end{array} \right]\)
What kind of data and tasks have DNNs been used for ?
(c) One Fourth Labs
28x28 Images
\( \left[\begin{array}{lcr} 1.00, 0.72, 0.37 \dots, 0.76, 0.99, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 1.00, 0.85, 0.73 \dots, 0.68, 1.00, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 1.00, 0.76, 0.64 \dots, 0.86, 0.99, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.99, 0.82, 0.26 \dots, 0.53, 0.87, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.73, 0.81, 0.87 \dots, 0.76, 0.79, 0.67 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.84, 0.72, 0.31 \dots, 0.26, 0.51, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 1.00, 1.00, 0.96 \dots, 0.88, 0.79, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.33, 0.52, 0.47 \dots, 0.76, 0.95, 1.00 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.85, 0.72, 0.97 \dots, 0.86, 0.94, 0.99 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0.84, 0.92, 0.28 \dots, 0.76, 1.00, 0.99 \end{array} \right]\)
Class Label
0
1
2
3
4
5
6
7
8
9
Class labels can be represented as one hot vectors
Class Labels - One hot Representation
\( \left[\begin{array}{lcr} 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 0, 1, 0, 0, 0, 0, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 0, 0, 1, 0, 0, 0, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 0, 0, 0, 1, 0, 0, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 0, 0, 0, 0, 1, 0, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 0, 0, 0, 0, 0, 1, 0, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 \end{array} \right]\)
\( \left[\begin{array}{lcr} 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 \end{array} \right]\)
What kind of data and tasks have DNNs been used for ?
(c) One Fourth Labs
- Now have two more slides on other Kaggle tasks for which DNNs have been tried (preferably, some non-image tasks and at least one regression task. You could also repeat the churn prediction task from before)
- Finally have 1 slide on our task which is multi character classification
- Same layout and animations repeated from the previous slide only data changes
- Show MNIST dataset sample on LHS
- Show by animation how you will flatten each image and convert it to a vector (of course you cannot show that
-
What kind of data and tasks have DNNs been used for ?
(c) One Fourth Labs
(c) One Fourth Labs
Indian Liver Patient Records \(^{*}\)
- whether person needs to be diagnosed or not ?
Age |
65 |
62 |
20 |
84 |
Albumin |
3.3 |
3.2 |
4 |
3.2 |
T_Bilirubin |
0.7 |
10.9 |
1.1 |
0.7 |
D |
0 |
0 |
1 |
1 |
\( \hat{y} = \hat{f}(x_1, x_2, .... ,x_{N}) \)
\( \hat{D} = \hat{f}(Age, Albumin,T\_Bilirubin,.....) \)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
What kind of data and tasks have DNNs been used for ?
(c) One Fourth Labs
(c) One Fourth Labs
Boston Housing\(^{*}\)
- Predict Housing Values in Suburbs of Boston
Crime |
0.00632 |
0.02731 |
0.3237 |
0.6905 |
Avg No of rooms |
6.575 |
6.421 |
6.998 |
7.147 |
Age |
65.2 |
78.9 |
45.8 |
54.2 |
House Value |
24 |
21.6 |
33.4 |
36.2 |
\( \hat{y} = \hat{f}(x_1, x_2, .... ,x_{N}) \)
\( \hat{D} = \hat{f}(Crime, Avg \ no \ of \ rooms, Age, .... ) \)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
How to build complex functions using Deep Neural Networks?
(c) One Fourth Labs
\(x_2\)
Cost
3.5
8k
12k
\( \hat{y} = \frac{1}{1+e^{-(w_1* x_1 + w_2*x_2+b)}} \)
\(w_1\)
\(w_2\)
\(x_2\)
\(x_1\)
\( \hat{y} \)
4.5
Screen size
\(x_1\)
\(b\)
\(w_{11}\)
\( \hat{y} = f(x_1,x_2) \)
How to build complex functions using Deep Neural Networks?
(c) One Fourth Labs
\(x_2\)
Cost
3.5
8k
12k
\( h = f(x_1,x_2) \)
\( h = \frac{1}{1+e^{-(w_{11}* x_1 + w_{12}*x_2+b_1)}} \)
\(w_{11}\)
\(w_{12}\)
\(x_2\)
\(x_1\)
\( \hat{y} \)
4.5
Screen size
\(x_1\)
\(b_1\)
\(b_2\)
\(w_{21}\)
\( \hat{y} = g(h) \)
\( = g(f(x_{1},x_{2})) \)
\(\hat{y} = \frac{1}{1+e^{-(w_{21}*h + b_2)}}\)
How to build complex functions using Deep Neural Networks?
(c) One Fourth Labs
\(x_2\)
Cost
3.5
8k
12k
\( h_1 = f_1(x_1,x_2) \)
\( h_1 = \frac{1}{1+e^{-(w_{11}* x_1 + w_{12}*x_2+b_1)}} \)
\(w_{11}\)
\(w_{12}\)
\(x_2\)
\(x_1\)
\( \hat{y} \)
4.5
Screen size
\(x_1\)
\(b_1\)
\(b_2\)
\(w_{21}\)
\( \hat{y} = g(h_1,h_2) \)
\(\hat{y} = \frac{1}{1+e^{-(w_{21}*h_1 + w_{22}*h_2 + b_2)}}\)
\(w_{14}\)
\(w_{13}\)
\(w_{22}\)
\( h_2 = f_2(x_1,x_2) \)
\( h_2 = \frac{1}{1+e^{-(w_{13}* x_1 + w_{14}*x_2+b_1)}} \)
Can we clarify the terminology a bit ?
(c) One Fourth Labs
\(h_{3} = \hat{y} = f(x) \)
\( a_i(x) = W_ih_{i-1}(x) + b_i \)
\( h_i(x) = g(a_i(x)) \)
\( f(x) = h_L = O(a_L) \)
where 'g' is called as the activation function
where 'O' is called as the output activation function
(c) One Fourth Labs
\(h_{L} = \hat{y} = f(x) \)
\(\hat{y} = f(x) = O(W_3g(W_2g(W_1x + b_1) + b_2) + b_3)\)
How do we decide the output layer ?
How do we decide the output layer ?
(c) One Fourth Labs
- On RHS show the imdb example from my lectures
- ON LHS show the apple example from my lecture
- Below LHS example, pictorially show other examples of regression from Kaggle
- Below RHS example, pictorially show other examples of classification from Kaggle
- Finally show that in our contest also we need to do regression (bounding box predict x,y,w,h) and classification (character recognition)
How do we decide the output layer ?
(c) One Fourth Labs
isActor
Damon
. . .
isDirector
Nolan
. . . .
\(x_i\)
imdb
Rating
critics
Rating
RT
Rating
\(y_i\) = { 8.8 7.3 8.1 846,320 }
\(y_i\) = { 1 0 0 0 }
Apple
Banana
Orange
Grape
Box Office
Collection
What is the output layer for regression problems ?
(c) One Fourth Labs
x = [x1, x2, x3, x4, x5]
def sigmoid(a):
return 1.0/(1.0+ np.exp(-a))
def output_layer(a):
return a
def forward_propagation(x):
L = 3 #Total number of layers
W = {...} #Assume weights are learnt
a[1] = W[1]*x + b[1]
for i in range(1,L):
h[i] = sigmoid(a[i])
a[i+1] = W[i+1]*h[i] + b[i+1]
Y = output_layer(a[L])
What is the output layer for classification problems ?
(c) One Fourth Labs
Apple
\(\hat{y}\) = { 1, 0, 0, 0 }
Banana
Orange
Grape
True Output :
\(\hat{y}\) = { 0.64, 0.03, 0.26, 0.07 }
Predicted Output :
What kind of output activation function should we use?
What is the output layer for classification problems ?
Apple
Banana
Orange
Grape
.
.
.
.
.
.
\(a_1 = W_1*x\)
What is the output layer for classification problems ?
\(a_1 = W_1*x\)
\(h_{11} = g(a_{11})\)
\(h_{12} = g(a_{12})\)
\(h_{1\ 10} = g(a_{1\ 10})\)
. . . .
Apple
Banana
Orange
Grape
\(h_1 = g(a_1)\)
What is the output layer for classification problems ?
(c) One Fourth Labs
Apple
Banana
Orange
Grape
.
.
.
.
.
.
\(a_2 = W_2*h_1\)
What is the output layer for classification problems ?
(c) One Fourth Labs
\(a_2 = W_2*h_1\)
\(h_{21} = g(a_{21})\)
\(h_{22} = g(a_{22})\)
\(h_{2\ 10} = g(a_{2\ 10})\)
. . . .
Apple
Banana
Orange
Grape
\(h_2 = g(a_2)\)
What is the output layer for classification problems ?
Apple
Banana
Orange
Grape
\(a_3 = W_3*h_2\)
What is the output layer for classification problems ?
\(a_3 = W_3*h_2\)
\(\hat{y}_{1} = O(a_{31})\)
\(\hat{y}_{2} = O(a_{32})\)
\(\hat{y}_{4} = O(a_{34})\)
\(\hat{y}_{3} = O(a_{33})\)
Apple
Banana
Orange
Grape
What is the output layer for classification problems ?
(c) One Fourth Labs
Take each entry and divide by the sum of all entries
We will now try using softmax function
Apple
Banana
Orange
Grape
What is the output layer for classification problems ?
(c) One Fourth Labs
What is the output layer for classification problems ?
(c) One Fourth Labs
\(h = [ h_{1} h_{2} h_{3} h_{4} ]\)
\(softmax(h) = [softmax(h_{1}) softmax(h_{2}) softmax(h_{3}) softmax(h_{4})] \)
Apple
Banana
Orange
Grape
What is the output layer for classification problems ?
(c) One Fourth Labs
Apple
Banana
Orange
Grape
\(a_2 = W_2*h_1\)
\(h_2 = g(a_2)\)
\(a_1 = W_1*x\)
\(a_3 = W_3*h_2\)
\(h_1 = g(a_1)\)
\(\hat{y} = softmax(a_3)\)
What is the output layer for regression problems ?
isActor
Damon
. . .
isDirector
Nolan
. . . .
\(x_i\)
Box Office
Collection
\(\hat{y}\) = $ 15,032,493.29
True Output :
\(\hat{y}\) = $ 10,517,330.07
Predicted Output :
What kind of output function should we use?
.
.
.
.
.
.
\(a_1 = W_1*x\)
What is the output layer for regression problems ?
Box Office
Collection
isActor
Damon
. . .
isDirector
Nolan
. . .
\(x_i\)
\(a_1 = W_1*x\)
\(h_{11} = g(a_{11})\)
\(h_{12} = g(a_{12})\)
\(h_{1\ 5} = g(a_{1\ 5})\)
. . . .
\(h_1 = g(a_1)\)
Box Office
Collection
isActor
Damon
. . .
isDirector
Nolan
. . .
\(x_i\)
What is the output layer for regression problems ?
.
.
.
.
.
.
\(a_2 = W_2*h_1\)
Box Office
Collection
isActor
Damon
. . .
isDirector
Nolan
. . .
\(x_i\)
What is the output layer for regression problems ?
\(a_2 = W_2*h_1\)
\(h_{21} = g(a_{21})\)
\(h_{22} = g(a_{22})\)
\(h_{2\ 5} = g(a_{2\ 5})\)
. . . .
\(h_2 = g(a_2)\)
Box Office
Collection
isActor
Damon
. . .
isDirector
Nolan
. . .
\(x_i\)
What is the output layer for regression problems ?
\(a_3 = W_3*h_2\)
Box Office
Collection
isActor
Damon
. . .
isDirector
Nolan
. . .
\(x_i\)
\(\hat{y} = O(a_{3})\)
What is the output layer for regression problems ?
Can we use sigmoid function ?
NO
What is the output layer for regression problems ?
Can we use softmax function ?
NO
Can we use real numbered pre-activation as it is ?
Yes, it is a real number after all
What happens if we get a negative output ?
Should we not normalize it ?
Box Office
Collection
isActor
Damon
. . .
isDirector
Nolan
. . .
\(x_i\)
(c) One Fourth Labs
\(a_2 = W_2*h_1\)
\(h_2 = g(a_2)\)
\(a_1 = W_1*x\)
\(a_3 = W_3*h_2\)
\(h_1 = g(a_1)\)
\(\hat{y} = a_3\)
What is the output layer for regression problems ?
Box Office
Collection
isActor
Damon
. . .
isDirector
Nolan
. . .
\(x_i\)
Can we see the model in action?
(c) One Fourth Labs
1) We will show the demo which Ganga is preparing
In practice how would you deal with extreme non-linearity ?
(c) One Fourth Labs
-
-
-
-
-
-
-
-
-
-
-
-
In practice how would you deal with extreme non-linearity ?
(c) One Fourth Labs
\(Model\)
\(Loss\)
Why is Deep Learning also called Deep Representation Learning ?
(c) One Fourth Labs
Apple
Banana
Orange
Grape
What is the loss function that you use for a regression problem ?
(c) One Fourth Labs
Size in feet^2 | No of bedrroms | House Rent (Rupees) in 1000's |
---|---|---|
850 | 2 | 12 |
1100 | 2 | 20 |
1000 | 3 | 19 |
.... | .... | .... |
\(h_{2} = \hat{y} = f(x) \)
\(a_1 = W_1*x + b_1 = [ 0.67 -0.415 ]\)
\(h_1 = sigmoid(a_1) = [ 0.66 0.40 ]\)
\(a_2 = W_2*h_1 + b_2 = 11.5 \)
\(h_2 = a_2 = 11.5\)
Output :
Squared Error Loss :
\(L(\Theta) = (11.5 - 12)^2\)
\(= (0.25)\)
What is the loss function that you use for a regression problem ?
(c) One Fourth Labs
Size in feet^2 | No of bedrroms | House Rent (Rupees) in 1000's |
---|---|---|
850 | 2 | 12 |
1100 | 2 | 14 |
1000 | 3 | 15 |
.... | .... | .... |
\(h_{2} = \hat{y} = f(x) \)
\(a_1 = W_1*x + b_1 = [ 0.72 -0.39 ]\)
\(h_1 = sigmoid(a_1) = [ 0.67 0.40 ]\)
\(a_2 = W_2*h_1 + b_2 = 11.6 \)
\(h_2 = a_2 = 11.6\)
Output :
Squared Error Loss :
\(L(\Theta) = (11.6 - 14)^2\)
\(= (5.76)\)
What is the loss function that you use for a regression problem ?
(c) One Fourth Labs
Size in feet^2 | No of bedrroms | House Rent (Rupees) in 1000's |
---|---|---|
850 | 2 | 12 |
1100 | 2 | 14 |
1000 | 3 | 15 |
.... | .... | .... |
\(h_{2} = \hat{y} = f(x) \)
\(a_1 = W_1*x + b_1 = [ 0.95 -0.65 ]\)
\(h_1 = sigmoid(a_1) = [ 0.72 0.34 ]\)
\(a_2 = W_2*h_1 + b_2 = 11.5 \)
\(h_2 = a_2 = 11.5\)
Output :
Squared Error Loss :
\(L(\Theta) = (11.5 - 15)^2\)
\(= (12.25)\)
What is the loss function that you use for a regression problem ?
(c) One Fourth Labs
Size in feet^2 | No of bedrroms | House Rent (Rupees) in 1000's |
---|---|---|
850 | 2 | 12 |
1100 | 2 | 14 |
1000 | 3 | 15 |
.... | .... | .... |
\(h_{2} = \hat{y} = f(x) \)
X = [X1, X2, X3, X4, ..., XN] #N 'd' dimensiomal data points
Y = [y1, y2, y3, y4, ..., yN]
def sigmoid(a):
return 1.0/(1.0+ np.exp(-a))
def output_layer(a):
return a
def forward_propagation(X):
L = 3 #Total number of layers
W = {...} #Assume weights are learnt
a[1] = W[1]*X + b[1]
for i in range(1,L):
h[i] = sigmoid(a[i])
a[i+1] = W[i+1]*h[i] + b[i+1]
return output_layer(a[L])
def compute_loss(X,Y):
N = len(X) #Number of data points
loss = 0
for x,y in zip(X,Y):
fx = forward_propagation(X)
loss += (1/N)*(fx - y)**2
return loss
\(x_i\)
\(a_1 = W_1*x + b_1 = [ 0.8 0.52 0.68 0.7 ]\)
\(h_1 = sigmoid(a_1) = [ 0.69 0.63 0.66 0.67 ]\)
\(a_2 = W_2*h_1 + b_2 = 0.948\)
\(\hat{y} = sigmoid(a_2) = 0.7207\)
Output :
Cross Entropy Loss:
\(L(\Theta) = -1*\log({0.7207})\)
\(= 0.327\)
What is the loss function that you use for a binary classification problem ?
\(x_i\)
\(a_1 = W_1*x + b_1 = [ 0.01 0.71 0.42 0.63 ]\)
\(h_1 = sigmoid(a_1) = [ 0.50 0.67 0.60 0.65 ]\)
\(a_2 = W_2*h_1 + b_2 = 0.921\)
\(\hat{y} = sigmoid(a_2) = 0.7152\)
Output :
Cross Entropy Loss:
\(L(\Theta) = -1*\log({1- 0.7152})\)
\(= 1.2560\)
What is the loss function that you use for a binary classification problem ?
\(x_i\)
What is the loss function that you use for a binary classification problem ?
X = [X1, X2, X3, X4, ..., XN] #N 'd' dimensiomal data points
Y = [y1, y2, y3, y4, ..., yN]
def sigmoid(a):
return 1.0/(1.0+ np.exp(-a))
def output_layer(a):
return a
def forward_propagation(X):
L = 3 #Total number of layers
W = {...} #Assume weights are learnt
a[1] = W[1]*X + b[1]
for i in range(1,L):
h[i] = sigmoid(a[i])
a[i+1] = W[i+1]*h[i] + b[i+1]
return output_layer(a[L])
def compute_loss(X,Y):
N = len(X) #Number of data points
loss = 0
for x,y in zip(X,Y):
fx = forward_propagation(X)
if y == 0:
loss += -(1/N)*np.log(1-fx)
else:
loss += -(1/N)*np.log(fx)
return loss
What is the loss function that you use for a multi-class classification problem ?
\(x_i\)
\(a_1 = W_1*x + b_1 = [ -0.19 -0.16 -0.09 0.77 ]\)
\(h_1 = sigmoid(a_1) = [ 0.45 0 .46 0 .49 0.68 ]\)
\(a_2 = W_2*h_1 + b_2 = [ 0.13 0.33 0.89 ]\)
\(\hat{y} = softmax(a_2) = [ 0.23 0.28 0.49 ]\)
Output :
Cross Entropy Loss:
\(L(\Theta) = -1*\log({0.28})\)
\(= 1.2729\)
What is the loss function that you use for a multi-class classification problem ?
\(x_i\)
\(a_1 = W_1*x + b_1 = [ 0.62 0.09 0.2 -0.15 ]\)
\(h_1 = sigmoid(a_1) = [ 0.65 0.52 0.55 0.46 ]\)
\(a_2 = W_2*h_1 + b_2 = [ 0.32 0.29 0.85 ]\)
Output :
Cross Entropy Loss:
\(L(\Theta) = -1*\log({0.4648})\)
\(= 0.7661\)
\(\hat{y} = softmax(a_2) = [ 0.2718 0.2634 0.4648 ]\)
What is the loss function that you use for a multi-class classification problem ?
\(x_i\)
\(a_1 = W_1*x + b_1 = [ 0.31 0.39 0.25 -0.54 ]\)
\(h_1 = sigmoid(a_1) = [ 0.58 0.60 0.56 0.37 ]\)
\(a_2 = W_2*h_1 + b_2 = [ 0.39 0.18 0.79 ]\)
\(\hat{y} = softmax(a_2) = [ 0.3024 0.2462 0.4514 ]\)
Output :
Cross Entropy Loss:
\(L(\Theta) = -1*\log({0.4514})\)
\(= 0.7954\)
What is the loss function that you use for a multi-class classification problem ?
\(x_i\)
\(a_1 = W_1*x + b_1 = [ 0.31 0.39 0.25 -0.54 ]\)
\(h_1 = sigmoid(a_1) = [ 0.58 0.60 0.56 0.37 ]\)
\(a_2 = W_2*h_1 + b_2 = [ 0.39 0.18 0.79 ]\)
\(\hat{y} = softmax(a_2) = [ 0.3024 0.2462 0.4514 ]\)
Output :
Cross Entropy Loss:
\(L(\Theta) = -1*\log({0.4514})\)
\(= 0.7954\)
What is the loss function that you use for a multi-class classification problem ?
\(x_i\)
X = [X1, X2, X3, X4, ..., XN] #N 'd' dimensiomal data points
Y = [y1, y2, y3, y4, ..., yN]
def sigmoid(a):
return 1.0/(1.0+ np.exp(-a))
def output_layer(a):
return a
def forward_propagation(X):
L = 3 #Total number of layers
W = {...} #Assume weights are learnt
a[1] = W[1]*X + b[1]
for i in range(1,L):
h[i] = sigmoid(a[i])
a[i+1] = W[i+1]*h[i] + b[i+1]
return output_layer(a[L])
def compute_loss(X,Y):
N = len(X) #Number of data points
loss = 0
for x,y in zip(X,Y):
fx = forward_propagation(X)
for i in range(len(y))
loss += -(1/N)*(y[i])*np.log(1-fx[i])
return loss
What have we learned so far?
(c) One Fourth Labs
\(x_i\)
But, who will give us the weights ?
Learning Algorithm
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
\(?\)
\(?\)
\(?\)
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
\(?\)
\(Say \ \ f(x) =(1/x)\)
\(, \ \ g(x) = e^{-x^{2}}\)
\(Say \ \ p(x) =e^{x}\)
\(, \ \ q(x) = -x^{2}\)
\(Say \ \ m(x) =-x\)
\(, \ \ n(x) = x^{2}\)
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
\(?\)
\(Say \ \ f(x) =sin(x)\)
\(, \ \ g(x) = 1/e^{-x^{2}}\)
\(?\)
\(Say \ \ f(x) =cos(x)\)
\(, \ \ g(x) = sin(1/e^{-x^{2}})\)
\(?\)
\(Say \ \ f(x) =log(x)\)
\(, \ \ g(x) = cos(sin(1/e^{-x^{2}}))\)
Can we do a quick recap of some basic calculus ?
\(x\)
\(x^{2}\)
\(e^{-x}\)
\( sin(1/x)\)
\( cos(x)\)
\( log(x)\)
\(w_{1}\)
\(w_{2}\)
\(w_{3}\)
\(w_{4}\)
\(w_{5}\)
How do we compute partial derivative ?
Assume that all other variables are constant
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
\(x\)
\(w_{1}\)
\(w_{2}\)
\(w_{3}\)
\(w_{4}\)
\(w_{5}\)
\(h_{1}\)
\(h_{2}\)
\(y\)
\(w_{7}\)
\(w_{6}\)
\(y\)
\(h_{3}\)
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
\(x\)
\(w_{1}\)
\(w_{2}\)
\(w_{3}\)
\(w_{4}\)
\(w_{5}\)
\(h_{1}\)
\(h_{2}\)
\(w_{7}\)
\(w_{6}\)
\(y\)
\(h_{3}\)
(c) One Fourth Labs
Wouldn't it be tedious to compute such a partial derivative w.r.t all variables ?
Well, not really. We can reuse some of the work.
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
Can we do a quick recap of some basic calculus ?
(c) One Fourth Labs
What are the key takeaways ?
(c) One Fourth Labs
\(No\ matter\ how\ complex\ the\ function,\)
\(we\ can\ always\ compute\ the\ derivative\ wrt\) \(any\ variable\ using\ the\ chain\ rule\)
\(We\ can\ reuse\ a\ lot\ of\ work\ by\)
\(starting\ backwards\ and\ computing\)
\(simpler\ elements\ in\ the\ chain\)
What is a gradient ?
\(Gradient\ is\ simply\ a\ collection\ of\ partial \ derivatives\)
Can we use the same Gradient Descent algorithm as before ?
\(x\)
\( Earlier: w, b\)
\(Now: w_{11}, w_{12}, ... \)
\( Earlier: L(w, b)\)
\(Now: L(w_{11}, w_{12}, ...) \)
\(x_i\)
X = [0.5, 2.5]
Y = [0.2, 0.9]
def f(x,w,b): #sigmoid with parameters w,b
return 1.0/(1.0+ np.exp(-(w*x + b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += 0.5*(fx - y)**2
return err
def grad_w(x,y,w,b):
fx = f(x,w,b)
return (fx - y)*fx*(1 - fx)*x
def grad_b(x,y,w,b):
fx = f(x,w,b)
return (fx - y)*fx*(1 - fx)
def do_gradient_descent():
w, b, eta, max_epochs = -2, -2, 1.0, 1000
for i in rang(max_epochs):
dw, db = 0, 0
for x, y in zip(X,Y):
dw += grad_w(x,y,w,b)
db += grad_b(x,y,w,b)
w = w - eta*dw
b = b - eta*db
Can we use the same Gradient Descent algorithm as before ?
X = [0.5, 2.5]
Y = [0.2, 0.9]
def f(x,w,b): #sigmoid with parameters w,b
return 1.0/(1.0+ np.exp(-(w*x + b)))
def error(w,b):
err = 0.0
for x,y in zip(X,Y):
fx = f(x,w,b)
err += 0.5*(fx - y)**2
return err
def grad_w(x,y,w,b):
fx = f(x,w,b)
return (fx - y)*fx*(1 - fx)*x
def grad_b(x,y,w,b):
fx = f(x,w,b)
return (fx - y)*fx*(1 - fx)
def do_gradient_descent():
w, b, eta, max_epochs = -2, -2, 1.0, 1000
for i in rang(max_epochs):
dw, db = 0, 0
for x, y in zip(X,Y):
dw += grad_w(x,y,w,b)
db += grad_b(x,y,w,b)
w = w - eta*dw
b = b - eta*db
How many derivatives do we need to compute and how do we compute them?
(c) One Fourth Labs
\(x_i\)
How many derivatives do we need to compute and how do we compute them?
(c) One Fourth Labs
\(x_i\)
How do we compute the partial derivatives ?
\(x_2\)
\(x_1\)
\(x_3\)
\(x_4\)
\(a_1 = W_1*x + b_1 = [ 2.9 1.4 2.1 2.3 ]\)
\(h_1 = sigmoid(a_1) = [ 0.95 0.80 0.89 0.91 ]\)
\(a_2 = W_2*h_1 + b_2 = [ 1.66 0.45 ]\)
\(\hat{y} = softmax(a_2) = [ 0.77 0.23 ]\)
Output :
Squared Error Loss :
\(L(\Theta) = (1 - 0.77)^2 + (0.23)^2\)
\(= 0.1058\)
How do we compute the partial derivatives ?
\(x_2\)
\(x_1\)
\(x_3\)
\(x_4\)
\(x_2\)
\(x_1\)
\(x_3\)
\(x_4\)
Can we see one more example ?
\(x_2\)
\(x_1\)
\(x_3\)
\(x_4\)
\(a_1 = W_1*x + b_1 = [ 2.9 1.4 2.1 2.3 ]\)
\(h_1 = sigmoid(a_1) = [ 0.95 0.80 0.89 0.91 ]\)
\(a_2 = W_2*h_1 + b_2 = [ 1.66 0.45 ]\)
\(\hat{y} = softmax(a_2) = [ 0.77 0.23 ]\)
Output :
Cross Entropy Loss :
\(L(\Theta) = -1*\log(0.77) \)
\(= 0.1135\)
What happens if we change the loss function ?
\(x_2\)
\(x_1\)
\(x_3\)
\(x_4\)
What happens if we change the loss function ?
Isn't this too tedious ?
(c) One Fourth Labs
Show a small DNN on LHS
ON RHS now show a pytorch logo
Now show the compute graph for one of the weights
nn.backprop() is all you need to write in PyTorch
How do you check the performance of a deep neural network?
(c) One Fourth Labs
Test Data
Indian Liver Patient Records \(^{*}\)
- whether person needs to be diagnosed or not ?
Age |
65 |
62 |
20 |
84 |
Albumin |
3.3 |
3.2 |
4 |
3.2 |
T_Bilirubin |
0.7 |
10.9 |
1.1 |
0.7 |
y |
0 |
0 |
1 |
1 |
.
.
.
Predicted |
0 |
1 |
1 |
0 |
What are the new things that we learned in this module ?
(c) One Fourth Labs
\( x_i \in \mathbb{R} \)
Loss
Model
Data
Task
Evaluation
Learning
Real inputs
Tasks with Real Inputs and Real Outputs
Back-propagation
Squared Error Loss :
Cross Entropy Loss: