Building a deep neural network

&

How could we use it as a density estimator

Based on :

[1] My tech post, "A neural network wasn't built in a day" (一段关于神经网络的故事) (2017)

[2] 1903.01998, 1909.06296, 2002.07656, 2008.03312; PRL(2020) 124 041102

Journal Club - Oct 20, 2020

Content

What is a Neural Network (NN) anyway?
- One neural
- One layer of neural
- Basic types of neural network in academic papers
A concise summary of current GW ML parameter estimation studies
- MAP, CVAE, Flow
- GSN (optional)

What is a Neural Network (NN) anyway?

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)
Eg:

Yes or No

A number

A sequence

\{x_1, x_2, \dots, x_n\}

\{0 \text{ or } 1\}

What is a Neural Network (NN) anyway?

What's happend in a neural?（一个神经元的本事）

\sum_{i} w_{i} x_{i}+b=w_{0} x_{0}+w_{1} x_{1}+\cdots+w_{D-1} x_{D-1}+b

\underbrace{\left[\sum_{i} w_{i} x_{i}+b\right]}_{1 \times 1}=\underbrace{\left[\cdots \quad x_{i} \quad \ldots\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{[b]}_{1 \times 1}

Initialize a weight vector and a bias scalar randomly in one neural.
Input a sample, then it gives us an output.

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"score"

"your performance"

"one judge"

What is a Neural Network (NN) anyway?

What's happend in a neural?（一个神经元的本事）

Initialize a weight vector and a bias vector randomly in one neural.
Input some samples, then it gives us some outputs.

"a bunch guys' show"

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"scores"

"one judge"

\underbrace{\left[\begin{array}{c} \sum_{i} w_{i} x_{i}+b \\ \vdots \end{array}\right]}_{N \times 1}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i} & \cdots \\ \vdots & & \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{\left[\begin{array}{c} b \\ \vdots \\ \vdots \end{array}\right]}_{N \times 1}

What is a Neural Network (NN) anyway?

What's happend in a neural?（一个神经元的本事）

Initialize a weight vector and a bias vector randomly in one neural.
Input some samples, then it gives us some outputs.

f(x)=\max (0, x)

ReLU

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"a bunch guys' show"

"one judge"

"scores"

What is a Neural Network (NN) anyway?

Generalize to one layer of neural（层状的神经元）

Initialize a weight matrix and a bias scalar randomly in one layer.
Input one sample, then it gives us an output.

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

\underbrace{\left[\sum_{i} w_{i} x_{i}+b \quad \cdots\right]}_{1 \times 10}=\underbrace{\left[\begin{array}{lll} \cdots & x_{i} & \cdots \end{array}\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{lll} \vdots \\ w_{i} & \cdots & \cdots \\ \vdots \end{array}\right]}_{D \times 10}+\underbrace{\left[\begin{array}{ll} b & \cdots \end{array}\right]}_{\text {Broadcasting }}

"score"

"10 judges"

"your performance"

What is a Neural Network (NN) anyway?

Generalize to one layer of neural（层状的神经元）

Initialize a weight matrix and a bias scalar randomly in one layer.
Input one sample, then it gives us an output.

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"10 judges"

"scores"

"a bunch guys' show"

\underbrace{\left[\begin{array}{c} \sum_{i} w_{i} x_{i}+b \\ \vdots \end{array}\right]}_{N \times D}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i} & \cdots \\ \vdots & & \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{lll} \vdots \\ w_{i} & \cdots & \cdots \\ \vdots \end{array}\right]}_{D \times 10}+ \underbrace{\left[\begin{array}{ll} b & \cdots \\ \vdots \\ \vdots \end{array}\right]}_{N \times 10 \\ \text { (Broadcasting)}\\ }

What is a Neural Network (NN) anyway?

Fully-connected neural layers（全连接的神经元层）

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

In each layer:

input

output

num of neurals in this layer

Draw how one sample data flows

For a fully-connected neural network, only the number of weight columns in each layer is the hyper-parameter we need to fix first.
Only one word for the evaluation... (no more slices)
- Loss/cost/error func.
- Gradient descent / backward propagation algorithm

And how the shape of data changs.

What is a Neural Network (NN) anyway?

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

Convolution is a specialized kind of linear operation.（卷积层是全连接层的一种特例）

\mathbf{s}=\left[\begin{array}{llll} s_{0} & s_{1} & \cdots & s_{7} \end{array}\right]=\left[\begin{array}{llll} x_{0} & x_{1} & \cdots & x_{4} \end{array}\right] \cdot\left[\begin{array}{cccccccc} w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 & 0 \\ 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 \\ 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 \\ 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 \\ 0 & 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} \end{array}\right]=\mathbf{x} \cdot \mathbf{w}

Flip-and-slide Form

Matrix Form :

D=5, M=4

s(t)=\int x(\tau) w(t-\tau) d \tau=:(x * w)(t)

\begin{aligned} \left(a_{1} x_{1}+a_{1} x_{2}\right) * w &=a_{1}\left(x_{1} * w\right)+a_{2}\left(x_{2} * w\right) \,&(􏰣􏰤􏰫􏰪􏰩􏰎􏰤􏰒􏱐\text{linearity})\\ (x * w)(t-T) &=x(t-T) * w(t) \,&(\text{time invariance}) \end{aligned}

s[n]=\sum_{m=\max (0, n-D)}^{\min (n, M)} x[m] \cdot w[n-m], n=0,1, \ldots, D+M

\begin{aligned} &x[n], n=0,1, \ldots, D-1;\\ &w[n], n=0,1, \ldots, M-1 \end{aligned}

Integral Form

Discrete Form

参数共享 parameter sharing
稀疏交互 sparse interactions

What is a Neural Network (NN) anyway?

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

Neural networks in academic papers.（GW文献中的神经网络）

PRD. 100, 063015 (2019)

Mach. Learn.: Sci. Technol. 1 025014 (2020)

2003.09995

Expert Systems With Applications 151 (2020) 113378

"All of the current GW ML parameter estimation studies are still at the proof-of-principle stage" [2005.03745]

A concise summary of current GW ML parameter estimation studies

Real-time regression

Huerta's Group [PRD, 2018, 97, 044039], [PLB, 2018, 778, 64-70], [1903.01998], [2004.09524]
Fan et al. [SCPMA, 62, 969512 (2019)]
Carrillo et al. [GRG, 2016, 48, 141], [IJMP, 2017, D27, 1850043]
Santos et al. [2003.09995]
*Li et al. [2003.13928] (Bayes neural networks)

Explicit posteriors density

Chua et al. [PRL, 2020, 124, 041102]
- produce Bayesian posteriors using neural networks.
Gabbard et al. [1909.06296] (CVAE)
- produce samples from the posterior.
- 256Hz, 1-sec, 4-D, simulated noise
- Strong agreement between Bilby and the CVAE.
Yamamoto & Tanaka [2002.12095] (CVAE)
- QNM frequencies estimations
Green et al. [2002.07656, 2008.03312] (MAF, CVAE+,Flow-based)
- optimal results for now

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)

\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)

\begin{aligned} p(x | y)=& \frac{\omega}{\sqrt{(2 \pi)^{n}|\operatorname{det} \Sigma(y)|}} \times \\ & \exp \left(-\frac{1}{2} \sum_{i j=1}^{n}\left(x_{i}-\mu_{i}(y)\right) \Sigma_{i j}^{-1}(y)\left(x_{j}-\mu_{j}(y)\right)\right) \end{aligned}

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}

y^{(i)}

\mu^{(i)}, \Sigma^{(i)}, \omega

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)

\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}

y^{(i)}

\mu^{(i)}, \Sigma^{(i)}, \omega

Gabbard et al. [1909.06296] (CVAE)

\log p(x | y)=\mathbb{E}_{q(z | x, y)} \log p(x | y)

\equiv\mathcal{L}_{ELBO} + D_{KL}[q(z|x,y)||p(z|x,y)]

L \sim \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} [-\mathcal{L}_{ELBO} ]

= \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)}\{-\mathbb{E}_{z\sim q(z|x,y)}\log[p(x|z,y)] +D_{KL}[q(z|x,y)||p(z|y)] \}

x^{(i)}

y^{(i)}

\mu_p^{(i)}, \Sigma_p^{(i)}

\mu_q^{(i)}, \Sigma_q^{(i)}

z^{(j)}

\mu^{(j)}, \Sigma^{(j)}

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)

\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}

y^{(i)}

\mu^{(i)}, \Sigma^{(i)}, \omega

Gabbard et al. [1909.06296] (CVAE)

\log p(x | y)=\mathbb{E}_{q(z | x, y)} \log p(x | y)

\equiv\mathcal{L}_{ELBO} + D_{KL}[q(z|x,y)||p(z|x,y)]

L \sim \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} [-\mathcal{L}_{ELBO} ]

= \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)}\{-\mathbb{E}_{z\sim q(z|x,y)}\log[p(x|z,y)] +D_{KL}[q(z|x,y)||p(z|y)] \}

x^{(i)}

y^{(i)}

\mu_p^{(i)}, \Sigma_p^{(i)}

\mu_q^{(i)}, \Sigma_q^{(i)}

z^{(j)}

\mu^{(j)}, \Sigma^{(j)}

Green et al. [2002.07656] (MAF, CVAE+)

L=\mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} \left[-\log \mathcal{N}(0,1)^{n}\left(f^{-1}(x)\right) -\log \left|\operatorname{det} \frac{\partial\left(f_{1}^{-1}, \ldots, f_{n}^{-1}\right)}{\partial\left(x_{1}, \ldots, x_{n}\right)}\right|\right]

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

CVAE (Train)

\(Y\) (·, 256)

\(X\) (·, 5)

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

(2, 8)

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

\(X_1\)

\(X_2\)

\(X_n\)

latent space

strain

params

The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE: conditional variational autoencoder

\(Y\) (·, 256)

\(X\) (·, 5)

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

(2, 8)

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

\(X_1\)

\(X_2\)

\(X_n\)

latent space

strain

params

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]

=:L_{ELBO}(P,Q,Y)

The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE (Train)

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

\(X_1\)

\(X_2\)

\(X_n\)

latent space

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]

=:L_{ELBO}(P,Q,Y)

The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE (Test)

\(Y\) (·, 256)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

strain

(2, 8)

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder

\(Y\) (·, 256)

\(X\) (·, 5)

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

(2, 8)

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

strain

params

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]

=:L_{ELBO}(P,Q,Y)

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

CVAE

Drawbacks:

KL divergence
\(L_{ELBO}\)

Mystery:

How could it still work that well？
(for simple case only)

The final slice. 💪

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder