Building a deep neural network

&

How could we use it as a density estimator

Based on :

[1] My tech post, "A neural network wasn't built in a day" (一段关于神经网络的故事) (2017)

[2] 1903.01998, 1909.06296, 2002.07656, 2008.03312; PRL(2020) 124 041102

Journal Club - Oct 20, 2020

Content

  • What is a Neural Network (NN) anyway?
    • One neural
    • One layer of neural
    • Basic types of neural network in academic papers
  • A concise summary of current GW ML parameter estimation studies
    • ​MAP, CVAE, Flow
    • GSN (optional)
  • What is a Neural Network (NN) anyway?

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)
  • Eg:

Yes or No

A number

A sequence

\{x_1, x_2, \dots, x_n\}
\{0 \text{ or } 1\}
  • What is a Neural Network (NN) anyway?

What's happend in a neural?(一个神经元的本事)

\sum_{i} w_{i} x_{i}+b=w_{0} x_{0}+w_{1} x_{1}+\cdots+w_{D-1} x_{D-1}+b
\underbrace{\left[\sum_{i} w_{i} x_{i}+b\right]}_{1 \times 1}=\underbrace{\left[\cdots \quad x_{i} \quad \ldots\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{[b]}_{1 \times 1}
  • Initialize a weight vector and a bias scalar randomly in one neural.
  • Input a sample, then it gives us an output.

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)

"score"

"your performance"

"one judge"

  • What is a Neural Network (NN) anyway?

What's happend in a neural?(一个神经元的本事)

  • Initialize a weight vector and a bias vector randomly in one neural.
  • Input some samples, then it gives us some outputs.

"a bunch guys' show"

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)

"scores"

"one judge"

\underbrace{\left[\begin{array}{c} \sum_{i} w_{i} x_{i}+b \\ \vdots \end{array}\right]}_{N \times 1}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i} & \cdots \\ \vdots & & \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{\left[\begin{array}{c} b \\ \vdots \\ \vdots \end{array}\right]}_{N \times 1}
  • What is a Neural Network (NN) anyway?

What's happend in a neural?(一个神经元的本事)

  • Initialize a weight vector and a bias vector randomly in one neural.
  • Input some samples, then it gives us some outputs.
f(x)=\max (0, x)

ReLU

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)

"a bunch guys' show"

"one judge"

\underbrace{\left[\begin{array}{c} \sum_{i} w_{i} x_{i}+b \\ \vdots \end{array}\right]}_{N \times 1}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i} & \cdots \\ \vdots & & \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{\left[\begin{array}{c} b \\ \vdots \\ \vdots \end{array}\right]}_{N \times 1}

"scores"

  • What is a Neural Network (NN) anyway?

Generalize to one layer of neural(层状的神经元)

  • Initialize a weight matrix and a bias scalar randomly in one layer.
  • Input one sample, then it gives us an output.

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)
\underbrace{\left[\sum_{i} w_{i} x_{i}+b \quad \cdots\right]}_{1 \times 10}=\underbrace{\left[\begin{array}{lll} \cdots & x_{i} & \cdots \end{array}\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{lll} \vdots \\ w_{i} & \cdots & \cdots \\ \vdots \end{array}\right]}_{D \times 10}+\underbrace{\left[\begin{array}{ll} b & \cdots \end{array}\right]}_{\text {Broadcasting }}

"score"

"10 judges"

"your performance"

  • What is a Neural Network (NN) anyway?

Generalize to one layer of neural(层状的神经元)

  • Initialize a weight matrix and a bias scalar randomly in one layer.
  • Input one sample, then it gives us an output.

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)

"10 judges"

"scores"

"a bunch guys' show"

\underbrace{\left[\begin{array}{c} \sum_{i} w_{i} x_{i}+b \\ \vdots \end{array}\right]}_{N \times D}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i} & \cdots \\ \vdots & & \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{lll} \vdots \\ w_{i} & \cdots & \cdots \\ \vdots \end{array}\right]}_{D \times 10}+ \underbrace{\left[\begin{array}{ll} b & \cdots \\ \vdots \\ \vdots \end{array}\right]}_{N \times 10 \\ \text { (Broadcasting)}\\ }
  • What is a Neural Network (NN) anyway?

Fully-connected neural layers(全连接的神经元层)

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)

In each layer:

input

output

num of neurals in this layer

Draw how one sample data flows

  • For a fully-connected neural network, only the number of weight columns in each layer is the hyper-parameter we need to fix first.
  • Only one word for the evaluation... (no more slices)
    • Loss/cost/error func.
    • Gradient descent / backward propagation algorithm

And how the shape of data changs.

  • What is a Neural Network (NN) anyway?

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)

Convolution is a specialized kind of linear operation.(卷积层是全连接层的一种特例

\mathbf{s}=\left[\begin{array}{llll} s_{0} & s_{1} & \cdots & s_{7} \end{array}\right]=\left[\begin{array}{llll} x_{0} & x_{1} & \cdots & x_{4} \end{array}\right] \cdot\left[\begin{array}{cccccccc} w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 & 0 \\ 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 \\ 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 \\ 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 \\ 0 & 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} \end{array}\right]=\mathbf{x} \cdot \mathbf{w}
  • Flip-and-slide Form
  • Matrix Form :
D=5, M=4
s(t)=\int x(\tau) w(t-\tau) d \tau=:(x * w)(t)
\begin{aligned} \left(a_{1} x_{1}+a_{1} x_{2}\right) * w &=a_{1}\left(x_{1} * w\right)+a_{2}\left(x_{2} * w\right) \,&(􏰣􏰤􏰫􏰪􏰩􏰎􏰤􏰒􏱐\text{linearity})\\ (x * w)(t-T) &=x(t-T) * w(t) \,&(\text{time invariance}) \end{aligned}
s[n]=\sum_{m=\max (0, n-D)}^{\min (n, M)} x[m] \cdot w[n-m], n=0,1, \ldots, D+M
\begin{aligned} &x[n], n=0,1, \ldots, D-1;\\ &w[n], n=0,1, \ldots, M-1 \end{aligned}
  • Integral Form
  • Discrete Form
  • 参数共享 parameter sharing
  • 稀疏交互 sparse interactions
  • What is a Neural Network (NN) anyway?

Objective:

  • Input a sample to a function (our NN).
  • Then evaluate how is it "close" to the truth (also called label)

Neural networks in academic papers.(GW文献中的神经网络

PRD. 100, 063015 (2019)

Mach. Learn.: Sci. Technol. 1 025014 (2020)

2003.09995

Expert Systems With Applications 151 (2020) 113378

"All of the current GW ML parameter estimation studies are still at the proof-of-principle stage" [2005.03745]

  • A concise summary of current GW ML parameter estimation studies

Real-time regression

  • Huerta​'s Group [PRD, 2018, 97, 044039], [PLB, 2018, 778, 64-70], [1903.01998], [2004.09524]
  • Fan et al. [SCPMA, 62, 969512 (2019)]
  • Carrillo et al. [GRG, 2016, 48, 141], [IJMP, 2017, D27, 1850043]
  • Santos et al. [2003.09995]
  • *Li et al. [2003.13928] (Bayes neural networks)

Explicit posteriors density

  • Chua et al. [PRL, 2020, 124, 041102]
    • produce Bayesian posteriors using neural networks.
  • Gabbard et al. [1909.06296] (CVAE)
    • ​produce samples from the posterior.
    • 256Hz, 1-sec, 4-D, simulated noise
    • Strong agreement between Bilby and the CVAE.
  • Yamamoto & Tanaka [2002.12095]​ (CVAE)
    • ​QNM frequencies estimations
  • Green et al. [2002.07656, 2008.03312] (MAF, CVAE+,Flow-based)
    • optimal results for now

 

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

  • The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
  • Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:
L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)
\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)
\begin{aligned} p(x | y)=& \frac{\omega}{\sqrt{(2 \pi)^{n}|\operatorname{det} \Sigma(y)|}} \times \\ & \exp \left(-\frac{1}{2} \sum_{i j=1}^{n}\left(x_{i}-\mu_{i}(y)\right) \Sigma_{i j}^{-1}(y)\left(x_{j}-\mu_{j}(y)\right)\right) \end{aligned}

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}
y^{(i)}
\mu^{(i)}, \Sigma^{(i)}, \omega
L

NN

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

  • The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
  • Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:
L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)
\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)
\begin{aligned} p(x | y)=& \frac{\omega}{\sqrt{(2 \pi)^{n}|\operatorname{det} \Sigma(y)|}} \times \\ & \exp \left(-\frac{1}{2} \sum_{i j=1}^{n}\left(x_{i}-\mu_{i}(y)\right) \Sigma_{i j}^{-1}(y)\left(x_{j}-\mu_{j}(y)\right)\right) \end{aligned}

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}
y^{(i)}
\mu^{(i)}, \Sigma^{(i)}, \omega
L

Gabbard et al. [1909.06296] (CVAE)

\log p(x | y)=\mathbb{E}_{q(z | x, y)} \log p(x | y)
\equiv\mathcal{L}_{ELBO} + D_{KL}[q(z|x,y)||p(z|x,y)]
L \sim \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} [-\mathcal{L}_{ELBO} ]
= \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)}\{-\mathbb{E}_{z\sim q(z|x,y)}\log[p(x|z,y)] +D_{KL}[q(z|x,y)||p(z|y)] \}
x^{(i)}
y^{(i)}
\mu_p^{(i)}, \Sigma_p^{(i)}
\mu_q^{(i)}, \Sigma_q^{(i)}
z^{(j)}
\mu^{(j)}, \Sigma^{(j)}

NN

NN

NN

NN

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

  • The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
  • Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:
L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]
L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)
\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)
\begin{aligned} p(x | y)=& \frac{\omega}{\sqrt{(2 \pi)^{n}|\operatorname{det} \Sigma(y)|}} \times \\ & \exp \left(-\frac{1}{2} \sum_{i j=1}^{n}\left(x_{i}-\mu_{i}(y)\right) \Sigma_{i j}^{-1}(y)\left(x_{j}-\mu_{j}(y)\right)\right) \end{aligned}

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}
y^{(i)}
\mu^{(i)}, \Sigma^{(i)}, \omega

NN

L

Gabbard et al. [1909.06296] (CVAE)

\log p(x | y)=\mathbb{E}_{q(z | x, y)} \log p(x | y)
\equiv\mathcal{L}_{ELBO} + D_{KL}[q(z|x,y)||p(z|x,y)]
L \sim \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} [-\mathcal{L}_{ELBO} ]
= \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)}\{-\mathbb{E}_{z\sim q(z|x,y)}\log[p(x|z,y)] +D_{KL}[q(z|x,y)||p(z|y)] \}
x^{(i)}
y^{(i)}

NN

NN

\mu_p^{(i)}, \Sigma_p^{(i)}
\mu_q^{(i)}, \Sigma_q^{(i)}

NN

z^{(j)}
\mu^{(j)}, \Sigma^{(j)}

Green et al. [2002.07656] (MAF, CVAE+)

L=\mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} \left[-\log \mathcal{N}(0,1)^{n}\left(f^{-1}(x)\right) -\log \left|\operatorname{det} \frac{\partial\left(f_{1}^{-1}, \ldots, f_{n}^{-1}\right)}{\partial\left(x_{1}, \ldots, x_{n}\right)}\right|\right]

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

  • The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
  • Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:
L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

CVAE (Train)

 

\(Y\) (·, 256)

\(X\) (·, 5)

KL

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)
\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

E2

E1

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

D

(2, 8)

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

\(X_1\)

\(X_2\)

\(X_n\)

latent space

strain

params

  •  The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE: conditional variational autoencoder

\(Y\) (·, 256)

\(X\) (·, 5)

KL

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)
\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

E2

E1

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

D

(2, 8)

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

\(X_1\)

\(X_2\)

\(X_n\)

latent space

strain

params

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]
=:L_{ELBO}(P,Q,Y)

P

P

Q

  •  The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE (Train)

 

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

\(X_1\)

\(X_2\)

\(X_n\)

latent space

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]
=:L_{ELBO}(P,Q,Y)
  •  The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE (Test)

 

\(Y\) (·, 256)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

E1

\(X'\) (·, 5)

D

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

strain

P

(2, 8)

P

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder

\(Y\) (·, 256)

\(X\) (·, 5)

KL

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)
\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

E2

E1

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

D

(2, 8)

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

strain

params

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]
=:L_{ELBO}(P,Q,Y)

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

P

P

Q

CVAE

 

 

Drawbacks:

  • KL divergence
  • \(L_{ELBO}\)

Mystery:

  • How could it still work that well?
    (for simple case only)

The final slice. 💪

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder