CVPR 2018
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, Yanbo Gao
Mathematical Notation
Independently RNN(Ind RNN)
Conclusion
Experiments
Question
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
$$\tag{1} h_t=\sigma(\textbf Wx_t+u\odot h_{t-1}+b)$$
\(h_t\) is the hidden state at time \(\mathrm t\), \(x_t\) is the input at time \(\mathrm t\).
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." CVPR 2018.
$$\tag{1} h_t=\sigma(\textbf Wx_t+u\odot h_{t-1}+b)$$
Consider single neuron in equation \((1)\)
$$\tag{2}\mathrm h_{n,t}=\sigma(w_nx_t+\mathrm u_n\mathrm h_{n,t-1}+\mathrm b_n)$$
\(\mathrm h_{n,t}\) and \(\mathrm u_n\) is the \(n-th\) neuron in \(h_t\) and \(u\).
$$\tag{2}\mathrm h_{n,t}=\sigma(w_nx_t+\mathrm u_n\mathrm h_{n,t-1}+\mathrm b_n)$$
$$\gdef\rm#1{\mathrm{#1}}\def\head{\frac{\partial \rm{J}_T}{\partial \rm{h}_{n,T}}}\begin{aligned}\frac{\partial \rm{J}_T}{\partial \rm{h}_{n, t}}&=\head\frac{\partial \rm{h}_{n,T}}{\partial \rm{h}_{n, t}}=\head\prod_{k=t}^{T-1}\frac{\partial \rm{h}_{n, k+1}}{\partial \rm{h}_{n, k}}\\&=\head\prod_{k=t}^{T-1}\sigma'_{n, k+1}\rm{u}_n\\&=\tag{3}\head \rm{u}_n^{T-t}\prod_{k=t}^{T-1}\underbrace{\sigma'_{n, k+1}}_{\text{activation}}\end{aligned}$$
\(\mathrm J_T\) is the objective at time step \(\mathrm T\).
From equation \((3)\)
$$\gdef\rm#1{\mathrm{#1}}\def\head{\frac{\partial \rm{J}_T}{\partial \rm{h}_{n,T}}}\frac{\partial \rm{J}_T}{\partial \rm{h}_{n, t}}=\head \underbrace{\rm{u}_n^{T-t}\prod_{k=t}^{T-1}\sigma'_{n, k+1}}_{\text{only consider this term}}$$
Keep gradient in specific range \([\epsilon, \gamma]\) to vaoid gradient vanishing and exploding
$$\def\act{\prod_{k=t}^{T-1}\sigma'_{n, k+1}}\gdef\rm#1{\mathrm{#1}}\\\tag{4}\epsilon\le\rm{u}_n^{T-t}\act\le\gamma\\\ \sqrt[T-t]{\frac{\epsilon}{\act}}\le\rm{u}_n\le\sqrt[T-t]{\frac{\gamma}{\act}}$$
From equation \((4)\), choose ReLU as activation
$$\def\act{\prod_{k=t}^{T-1}\sigma'_{n, k+1}}\gdef\rm#1{\mathrm{#1}}\\\tag{5}\sqrt[T-t]{\epsilon}\le\rm{u}_n\le\sqrt[T-t]{\gamma}$$
If necessary, relax lower bound to \(0\) to forget previous state and only consider current input \(x_t\)
$$\def\act{\prod_{k=t}^{T-1}\sigma'_{n, k+1}}\gdef\rm#1{\mathrm{#1}}\\\tag{6}0\le\rm{u}_n\le\sqrt[T-t]{\gamma}$$
Single Layer IndRNN
Residual IndRNN
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." CVPR 2018.
rewrite equation \((1)\) and add second fully connected layer \(h_{s, t}\)
$$\begin{aligned}\tag{7}h_{f, t}&=\mathbf W_fx_t+diag(u)h_{f, t-1}\\h_{s, t}&=\mathbf W_sh_{f, t}\end{aligned}$$
solve equations \((7)\), we have
$$h_{s, t}=\mathbf Wx_t+\mathbf Dh_{s, t-1}$$
\(\mathbf D\) is a diagonalizable matrix. Compare to Vanilla RNN
$$h_{s, t}=\mathbf Wx_t+\mathbf Uh_{s, t-1}$$
\(\mathbf U \in R^{2}\)
| Architecture | Space Complexity | Time Complexity |
|---|---|---|
| Vanilla RNN | ||
| 1-layer IndRNN | ||
| 2-layer IndRNN |
\(M\times N+N\times N\)
\(M\times N+N\)
\(M\times N+N\times N + 2\times N\)
\(N\times N\times T\)
\(N\times T\)
\(2\times N\times T\)
IndRNN
Vanilla RNN
Input sequences:
$$\begin{aligned}x_1&,& \cdots &,& x_{n1}&,&\cdots &,& x_{n2}&,& \cdots &,& x_T\\0&,& \cdots&,& 1&,& \cdots&,& 1&,& \cdots&,& 0\\\end{aligned}$$
Regression output:
$$x_{n1}+x_{n2}$$
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." CVPR 2018.
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." CVPR 2018.
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." CVPR 2018.
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." CVPR 2018.
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." CVPR 2018.
IndRNN already be implemented in TensorFlow.
Note that TF implementation does not include weight clipping.
Popular implementation: github
import tensorflow as tf
x = ... # inputs
cell = tf.contrib.rnn.IndRNNCell(num_hidden=128, activation=tf.nn.relu)
hidd = tf.contrib.rnn.static_rnn(cell, x)
tf.contrib.rnn.IndyGRUCell, source
$$r_j = \sigma\left([\mathbf W_r\mathbf x]_j +[\mathbf u_r\circ \mathbf h_{(t-1)}]_j\right)\\z_j = \sigma\left([\mathbf W_z\mathbf x]_j +[\mathbf u_z\circ \mathbf h_{(t-1)}]_j\right)\\\tilde{h}^{(t)}_j = \phi\left([\mathbf W \mathbf x]_j +[\mathbf u \circ \mathbf r \circ \mathbf h_{(t-1)}]_j\right)$$
tf.contrib.rnn.IndyLSTMCell, source
$$\begin{aligned}f_t &= \sigma_g\left(W_f x_t + u_f \circ h_{t-1} + b_f\right)\\i_t &= \sigma_g\left(W_i x_t + u_i \circ h_{t-1} + b_i\right)\\o_t &= \sigma_g\left(W_o x_t + u_o \circ h_{t-1} + b_o\right)\\c_t &= f_t \circ c_{t-1} +i_t \circ \sigma_c\left(W_c x_t + u_c \circ h_{t-1} + b_c\right)\end{aligned}$$
1. Use Frame-wise or Sequence-wise batch normalization ?
2. Compare weight clipping and gradient clipping, what is the advantage of weight clipping?