神經網路結構

?

多層感知器（mLP）

隱藏層

輸入層

輸出層

權重

感知器數學

x_1

x_2

x_3

z_1=w_1x_1+w_2x_2+w_3x_3+b_1

w_1

w_2

w_3

b_1

向量加法、內積

\mathbf{a}= \begin{bmatrix} a_1\\a_2\\a_3\\ \end{bmatrix}, \mathbf{b}= \begin{bmatrix} b_1\\b_2\\b_3\\ \end{bmatrix}

\mathbf{a}\cdot\mathbf{b}= a_1b_1+a_2b_2+a_3b_3

\mathbf{a}+\mathbf{b}= \begin{bmatrix} a_1+b_1\\a_2+b_2\\a_3+b_3\\ \end{bmatrix}

感知器數學

x_1

x_2

x_3

z_1=w_1x_1+w_2x_2+w_3x_3+b_1

w_1

w_2

w_3

b_1

感知器數學

x_1

x_2

x_3

z_1=\mathbf{w}\cdot\mathbf{x}+b_1

w_1

w_2

w_3

b_1

\mathbf{w}= \begin{bmatrix} w_1\\w_2\\w_3\\ \end{bmatrix}, \mathbf{x}= \begin{bmatrix} x_1\\x_2\\x_3\\ \end{bmatrix}

再一個感知器

x_1

x_2

x_3

z_1

b_1

再一個感知器

x_1

x_2

x_3

z_1

b_1

z_2

b_2

再一個感知器

x_1

x_2

x_3

z_1

b_1

z_2

b_2

w_{11}

w_{12}

w_{13}

w_{21}

w_{22}

w_{23}

再一個感知器

x_1

x_2

x_3

z_1=w_{11}x_1+w_{12}x_2+w_{13}x_3+b_1

b_1

z_2=w_{21}x_1+w_{22}x_2+w_{23}x_3+b_2

b_2

w_{11}

w_{12}

w_{13}

w_{21}

w_{22}

w_{23}

矩陣乘法

\mathbf{A}= \begin{bmatrix} a_{11}&a_{12}&a_{13}\\ a_{21}&a_{22}&a_{23}\\ \end{bmatrix}

\mathbf{x}= \begin{bmatrix} x_1\\x_2\\x_3\\ \end{bmatrix}

矩陣乘法

y_1=a_{11}x_1+a_{12}x_2+a_{13}x_3

y_2=a_{21}x_1+a_{22}x_2+a_{23}x_3

\mathbf{A}\mathbf{x}=\mathbf{y}= \begin{bmatrix} y_1\\y_2 \end{bmatrix}

矩陣乘法

\begin{bmatrix} a_{11}&a_{12}&a_{13}\\ a_{21}&a_{22}&a_{23}\\ \end{bmatrix}

\begin{bmatrix} x_1\\x_2\\x_3\\ \end{bmatrix}

\begin{bmatrix} y_1\\y_2\\ \end{bmatrix}

矩陣乘法

\begin{bmatrix} a_{11}&a_{12}&a_{13}\\ a_{21}&a_{22}&a_{23}\\ \end{bmatrix}

\begin{bmatrix} x_1\\x_2\\x_3\\ \end{bmatrix}

\begin{bmatrix} y_1\\y_2\\ \end{bmatrix}

矩陣乘法

\begin{bmatrix} a_{11}&a_{12}&a_{13}\\ a_{21}&a_{22}&a_{23}\\ \end{bmatrix}

\begin{bmatrix} x_1\\x_2\\x_3\\ \end{bmatrix}

\begin{bmatrix} y_1\\y_2\\ \end{bmatrix}

線代化

x_1

x_2

x_3

b_1

z_2=w_{21}x_1+w_{22}x_2+w_{23}x_3+b_2

b_2

w_{11}

w_{12}

w_{13}

w_{21}

w_{22}

w_{23}

z_1=w_{11}x_1+w_{12}x_2+w_{13}x_3+b_1

線代化

x_1

x_2

x_3

\mathbf{z}=\mathbf{W}\mathbf{x}+\mathbf{b}

b_1

b_2

w_{11}

w_{12}

w_{13}

w_{21}

w_{22}

w_{23}

\mathbf{x}= \begin{bmatrix} x_1\\x_2\\x_3 \end{bmatrix}, \mathbf{W}= \begin{bmatrix} w_{11}&w_{12}&w_{13}\\ w_{21}&w_{22}&w_{23} \end{bmatrix}, \mathbf{b}= \begin{bmatrix} b_1\\b_2 \end{bmatrix}, \mathbf{z}= \begin{bmatrix} z_1\\z_2 \end{bmatrix}

z_2

z_1

前向傳播

\mathbf{W}_1

\mathbf{W}_2

\mathbf{x}_0

\mathbf{x}_1=\mathbf{W}_1\mathbf{x}_0+\mathbf{b}_1

\mathbf{x}_2=\mathbf{W}_2\mathbf{x}_1+\mathbf{b}_2

退化

\mathbf{x}_1=\mathbf{W}_1\mathbf{x}_0+\mathbf{b}_1

\mathbf{x}_2=\mathbf{W}_2\mathbf{x}_1+\mathbf{b}_2

退化

\Bigg\Downarrow

代入

\mathbf{x}_1=\mathbf{W}_1\mathbf{x}_0+\mathbf{b}_1

\mathbf{x}_2=\mathbf{W}_2\mathbf{x}_1+\mathbf{b}_2

退化

\mathbf{x}_2=\mathbf{W}_2(\mathbf{W}_1\mathbf{x}_0+\mathbf{b}_1)+\mathbf{b}_2

\Bigg\Downarrow

代入

\mathbf{x}_1=\mathbf{W}_1\mathbf{x}_0+\mathbf{b}_1

退化

\mathbf{x}_2=(\mathbf{W}_2\mathbf{W}_1)\mathbf{x}_0+(\mathbf{W}_2\mathbf{b}_1+\mathbf{b}_2)

雖然經過了多層的變換

最後一層與第一層卻依然是線性關係

依然是線性分類器

激勵函數

在前向傳播中加入一些非線性的函數

前向傳播

\mathbf{W}_1

\mathbf{W}_2

\mathbf{x}_0

\mathbf{x}_1=\sigma(\mathbf{W}_1\mathbf{x}_0+\mathbf{b}_1)

\mathbf{x}_2=\sigma(\mathbf{W}_2\mathbf{x}_1+\mathbf{b}_2)

常見的激勵函數

\sigma(x)=\dfrac{1}{1+e^{-x}}

\tanh(x)=\dfrac{e^x-e^{-x}}{e^x+e^{-x}}

\mathrm{ReLU}(x)=\max(0, x)

損失函數

\mathbf{W}_1

\mathbf{W}_2

\mathbf{x}_0

\mathbf{x}_1

\mathbf{x}_2

\mathbf{y}

輸入

輸出

正解

\mathcal{L}=(\mathbf{x}_2-\mathbf{y})^2

差異

梯度下降

我們沒辦法直接求得損失函數的最小值

但我們可以從函數圖形的角度去思考

求出「目前參數下，損失函數圖形的梯度」

然後將參數往梯度反方向更新

就可以在損失函數圖形上往下坡走

梯度下降

損失

參數

梯度下降

損失

參數

梯度下降

損失

參數

梯度下降

損失

參數

梯度下降

(\mathbf{W},\mathbf{b})_{t+1}=(\mathbf{W},\mathbf{b})_t-\gamma\nabla\mathcal{L}

梯度下降可以用以下的數學式描述：

更新後參數

目前參數

學習率

（下降步長）

損失函數梯度

微分

y

x

\Delta x

\Delta y

函數圖形在某點的斜率

\approx\dfrac{\Delta y}{\Delta x}

微分

y

x

\Delta x

\Delta y

函數圖形在某點的斜率

\approx\dfrac{\Delta y}{\Delta x}

=\displaystyle\lim_{\Delta x \rightarrow 0}\dfrac{\Delta y}{\Delta x}

微分

y

x

\mathrm dx

\mathrm dy

函數圖形在某點的斜率

\approx\dfrac{\Delta y}{\Delta x}

=\displaystyle\lim_{\Delta x \rightarrow 0}\dfrac{\Delta y}{\Delta x}

=\dfrac{\mathrm dy}{\mathrm dx}

微分操作－定義法

計算微分的一種方法

f'(x)=\displaystyle\lim_{\Delta x\rightarrow0}\dfrac{f(x+\Delta x)-f(x)}{\Delta x}

是直接從微分的定義做起

註：

f'(x)

是

\dfrac{\mathrm df}{\mathrm dx}

的簡便寫法

微分操作－定義法

f(x)=x^3

\begin{aligned} f'(x) &=\displaystyle\lim_{\Delta x\rightarrow0}\dfrac{(x+\Delta x)^3-x^3}{\Delta x}\\ &=\displaystyle\lim_{\Delta x\rightarrow0}\dfrac{x^3+3x^2\Delta x+3x(\Delta x)^2+(\Delta x)^3-x^3}{\Delta x}\\ &=\displaystyle\lim_{\Delta x\rightarrow0}3x^2+3x\Delta x+(\Delta x)^2\\ &=3x^2 \end{aligned}

微分操作－規則法

所以實務上是直接記住微分的規則

直接從定義做會很麻煩

然後照這些規則去計算

微分規則

次方律：

\dfrac{\mathrm d(x^n)}{\mathrm dx}=nx^{n-1}

加法律：

\dfrac{\mathrm d(f+g)}{\mathrm dx}=\dfrac{\mathrm df}{\mathrm dx}+\dfrac{\mathrm dg}{\mathrm dx}

乘法律：

\dfrac{\mathrm d(fg)}{\mathrm dx}=\dfrac{\mathrm df}{\mathrm dx}g+f\dfrac{\mathrm dg}{\mathrm dx}

連鎖律：

\dfrac{\mathrm d(f(g(x)))}{\mathrm dx}=\dfrac{\mathrm df}{\mathrm dg}\dfrac{\mathrm dg}{\mathrm dx}

指數微分

\dfrac{\mathrm d(e^x)}{\mathrm dx}=e^x

\begin{aligned} \dfrac{\mathrm d(a^x)}{\mathrm dx} &=\dfrac{\mathrm d((e^{\ln a})^x)}{\mathrm dx}\\ &=\dfrac{\mathrm d(e^{x \ln a})}{\mathrm dx}\\ &=\dfrac{\mathrm d(e^{x \ln a})}{\mathrm d(x \ln a)}\dfrac{\mathrm d(x \ln a)}{\mathrm dx}\\ &=e^{x \ln a}\ln a\\ &=a^x\ln a\\ \end{aligned}

對數微分

\begin{aligned} & y=\ln x\\ & \Rightarrow e^y=x\\ & \Rightarrow \mathrm d(e^y)=\mathrm dx\\ & \Rightarrow \dfrac{\mathrm d(e^y)}{\mathrm dy}\mathrm dy=\mathrm dx\\ & \Rightarrow e^y\mathrm dy=\mathrm dx\\ & \Rightarrow \dfrac{\mathrm dy}{\mathrm dx}=\dfrac{1}{e^y}=\dfrac{1}{x} \end{aligned}

三角函數微分

\sin x

-\sin x

-\cos x

\cos x

\Longrightarrow

微分

\Big\Downarrow

微分

\Longleftarrow

微分

\Big\Uparrow

微分

\mathbf{W}_1

\mathbf{W}_2

\mathbf{x}_0

\mathbf{x}_1

\mathbf{x}_2

\mathbf{y}

輸入

輸出

正解

\mathcal{L}=(\mathbf{x}_2-\mathbf{y})^2

差異

對損失函數微分

\mathbf W_\ell

\mathbf{x}_{\ell-1}

\mathbf{x}_\ell

\mathbf{y}

輸出

正解