Optimization -- Visualize Optimization via PCA
Optimization -- Visualize Error Surface
Generalization -- Flatness
Generalization -- Sensitivity
發散猜想 -> network找到不一樣的solution
X
Y
b2
b1,1
b1,2
b1,3
b1,10
w1,1
w1,2
w1,3
w1,10
w2,10
w2,3
w2,2
w2,1
h
做出來的結果究竟是掉到等價答案
還是不同的local minima?
. . .
w1,1 <-> w1,2 |
b1,1 <-> b1,2 |
w2,1 <-> w2,2 |
Train on [0, 1]
Train on [0, 10]
Train on [0, 1]
Train on [0, 10]
維度降太快爛掉了orz
Train 4 times + Exchange 4
Only Exchange parameters
Train 8 - 9次的結果 (no parameters exchange)
Dense with bias(1->4->4->1)(|params|=33)
https://stats385.github.io/assets/lectures/Understanding_and_improving_deep_learing_with_random_matrix_theory.pdf
In highway..
S
w
w
w
w
w
S
S
S
S
Adam Family, e.g. Adam, Nadam, Adamax
if dataset is too small....
What is the true story?
Before that...
Can use in discrete problem!
Step1: Set initial x
Step2: x' = x + random()
Step3: change if P(x,x',T)
Step4: Repeated Step1-3.
Step1: Set initial state
Step2: x' = x + lr * GD
Step3: update x'
Step4: Repeated Step1-3.
2. only SGD cannot beat the noise gradient:
https://openreview.net/pdf?id=rkjZ2Pcxe
1. Data augmentation v.s. Noise gradient ?
There is no convenient function in tensorflow & pytorch.
https://arxiv.org/pdf/1511.06807.pdf