Recurrent Neural Networks
Sequence Processing Tasks
- Sequence tagging
- POS tagging
- Named Entity Recognition
- Sentence Compression
- Sequence Transduction
- Machine Translation
- Speech Recognition
- Dialogue
Recurrent Neural Networks
\( x_1\)
\( x_2\)
\( x_3\)
\( x_4\)
\( x_5\)
\( U \)
\( U \)
\( U \)
\( U \)
\( U \)
\( W \)
\( W \)
\( W \)
\( W \)
\( V \)
\( V \)
\( V \)
\( V \)
\( V \)
\( s_1 \)
\( s_2 \)
\( s_3 \)
\( s_4 \)
\( s_5 \)
\( \hat{y}_1\)
\( \hat{y}_2\)
\( \hat{y}_3\)
\( \hat{y}_4\)
\( \hat{y}_5\)
- \( s_i = \sigma (Ux_i + Ws_{i-1} + b)\)
- \( \hat{y}_i = o(Vs_i +c) \)
\( s_0\)
\( W \)
Find a cheap Chinese restaurant
\( s_i = RNN (s_{i-1},x_i)\)
VB DT JJ JJ NN
Recurrent Neural Networks
\( x_i\)
\( U \)
\( s_t \)
\( \hat{y}_i\)
\( s_{t-1}\)
\( W \)
\( V \in \mathbb{R}^{36 \times d}\)
- \( y_i = softmax(Vs_i +c) \)
- \( softmax(a)_i = \frac{e^{a_i}}{\sum_j e^{a_j}}\)
- \( \mathscr{L}_i\) = -\( \sum_c y_{ci} log \hat{y}_{ci}\)
HMM vs RNN
HMMs are simpler than RNN
- 36 Hidden States (POS Tags)
- 5000 words in vocabulary
- HMM Parameters:
- Transition Parameters: 36*36 = 1296
- Emission Parameters: 36*5000 = 180000
- Total = 181296
- RNN Parameters:
- \(U\) = 5000*300 = 1500000
- \(W\) = 300*300 = 90000
- \(V\) = 300*36 = 10800
- Total = 1600800
HMMs have less parameters, hence require less data
HMM vs RNN
- HMMs make a Markov Assumption
\( P(y_t | y_{t-1},...,y_2,y_1) = P(y_t | y_{t-1}) \)
- RNNs model the entire conditional dependency
\( P(w_6 |w_5,w_4,..,w_1)\)
\( x_1\)
\( x_2\)
\( x_3\)
\( x_4\)
\( x_5\)
\( U \)
\( U \)
\( U \)
\( U \)
\( U \)
\( W \)
\( W \)
\( W \)
\( W \)
\( V \)
\( s_1 \)
\( s_2 \)
\( s_3 \)
\( s_4 \)
\( s_5 \)
\( \hat{y}_5\)
\( s_0\)
\( W \)
Find me a cheap Chinese
HMM vs RNN
- HMMs are generative models
VB
DT
JJ
Find
a
cheap
\(P(x) = \sum_y P(x,y) \)
\( = \sum_y P(x|y)P(y)\)
\( = \sum_y \prod_tP(x_t|y_t)\prod_tP(y_t|y_{t-1})\)
HMM vs RNN
- RNNs are primarily discriminative models
\( x_1\)
\( x_2\)
\( x_3\)
\( x_4\)
\( x_5\)
\( U \)
\( U \)
\( U \)
\( U \)
\( U \)
\( W \)
\( W \)
\( W \)
\( W \)
\( V \)
\( s_1 \)
\( s_2 \)
\( s_3 \)
\( s_4 \)
\( s_5 \)
\( \hat{y}_5\)
\( s_0\)
\( W \)
\( P(\hat{y}_5 |x_5,x_4,..,x_1)\)
RNN
By suman banerjee
RNN
- 532