# MLDS HW1 Presentation

## Outline

• Optimization -- Visualize Optimization via PCA

• Optimization -- Visualize Error Surface

• Generalization -- Flatness

• Generalization -- Sensitivity

# Visualize Optimization via PCA

## Another Trial

X

Y

b2

b1,1

b1,2

b1,3

b1,10

w1,1

w1,2

w1,3

w1,10

w2,10

w2,3

w2,2

w2,1

h

.   .   .

 w1,1 <-> w1,2 b1,1 <-> b1,2 w2,1 <-> w2,2

Train on [0, 1]

Train on [0, 10]

Train on [0, 1]

Train on [0, 10]

## PCA + TSNE -1

Train 4 times + Exchange 4

Only Exchange parameters

## PCA + TSNE -2

Train 8 - 9次的結果 (no parameters exchange)

f(x) = x^2
$f(x) = x^2$
f(x) = sin(x) + cos(x^2)
$f(x) = sin(x) + cos(x^2)$

## Conclusion

2. 用PCA visualize的話，直接換weight就可以看到類似的結果了
3. TSNE降維一次降太多會爛掉
4. PCA + TSNE好像可以確定哪些是相同的local minima，哪些不是

# Error Surface

## PCA(->10) & TSNE(->2)

Dense with bias(1->4->4->1)(|params|=33)

# Indicator of Generalization- Flatness

## Why There's Flattness?

https://stats385.github.io/assets/lectures/Understanding_and_improving_deep_learing_with_random_matrix_theory.pdf

In highway..

S

w

w

w

w

w

S

S

S

S

# How to solve it?

if dataset is too small....

# Simulated annealing(模擬退火)

What is the true story?

Before that...

## 他不是只有聽起來猛而已。

Can use in discrete problem!

## Simulated Annealing

Step1: Set initial x

Step2:  x' = x + random()

Step3:  change if P(x,x',T)

Step4:  Repeated Step1-3.

## Step by Step Explanation

Step1: Set initial state

Step2:  x' = x + lr * GD

Step3:  update x'

Step4:  Repeated Step1-3.

# Doomsdays of Overfitting：

2. only SGD cannot beat the noise gradient:
https://openreview.net/pdf?id=rkjZ2Pcxe

1. Data augmentation v.s. Noise gradient ?

## WOW! Mr. Keras!

There is no convenient function in tensorflow & pytorch.

# With Sharpness?

## Performance - Noise/Dropout

https://arxiv.org/pdf/1511.06807.pdf