Building a deep neural network

&

How could we use it as a density estimator

Based on :

[1] My tech post, "A neural network wasn't built in a day" (一段关于神经网络的故事) (2017)

[2] 1903.01998, 1909.06296, 2002.07656, 2008.03312; PRL(2020) 124 041102

Journal Club - Oct 20, 2020

Content

What is a Neural Network (NN) anyway?
- One neural
- One layer of neural
- Basic types of neural network in academic papers
A concise summary of current GW ML parameter estimation studies
- MAP, CVAE, Flow
- GSN (optional)

What is a Neural Network (NN) anyway?

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)
Eg:

Yes or No

A number

A sequence

\{x_1, x_2, \dots, x_n\}

\{0 \text{ or } 1\}

What is a Neural Network (NN) anyway?

What's happend in a neural?（一个神经元的本事）

\sum_{i} w_{i} x_{i}+b=w_{0} x_{0}+w_{1} x_{1}+\cdots+w_{D-1} x_{D-1}+b

\underbrace{\left[\sum_{i} w_{i} x_{i}+b\right]}_{1 \times 1}=\underbrace{\left[\cdots \quad x_{i} \quad \ldots\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{[b]}_{1 \times 1}

Initialize a weight vector and a bias scalar randomly in one neural.
Input a sample, then it gives us an output.

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"score"

"your performance"

"one judge"

What is a Neural Network (NN) anyway?

What's happend in a neural?（一个神经元的本事）

Initialize a weight vector and a bias vector randomly in one neural.
Input some samples, then it gives us some outputs.

"a bunch guys' show"

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"scores"

"one judge"

\underbrace{\left[\begin{array}{c} \sum_{i} w_{i} x_{i}+b \\ \vdots \end{array}\right]}_{N \times 1}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i} & \cdots \\ \vdots & & \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{\left[\begin{array}{c} b \\ \vdots \\ \vdots \end{array}\right]}_{N \times 1}

What is a Neural Network (NN) anyway?

What's happend in a neural?（一个神经元的本事）

Initialize a weight vector and a bias vector randomly in one neural.
Input some samples, then it gives us some outputs.

f(x)=\max (0, x)

ReLU

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"a bunch guys' show"

"one judge"

"scores"

What is a Neural Network (NN) anyway?

Generalize to one layer of neural（层状的神经元）

Initialize a weight matrix and a bias scalar randomly in one layer.
Input one sample, then it gives us an output.

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

\underbrace{\left[\sum_{i} w_{i} x_{i}+b \quad \cdots\right]}_{1 \times 10}=\underbrace{\left[\begin{array}{lll} \cdots & x_{i} & \cdots \end{array}\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{lll} \vdots \\ w_{i} & \cdots & \cdots \\ \vdots \end{array}\right]}_{D \times 10}+\underbrace{\left[\begin{array}{ll} b & \cdots \end{array}\right]}_{\text {Broadcasting }}

"score"

"10 judges"

"your performance"

What is a Neural Network (NN) anyway?

Generalize to one layer of neural（层状的神经元）

Initialize a weight matrix and a bias scalar randomly in one layer.
Input one sample, then it gives us an output.

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

"10 judges"

"scores"

"a bunch guys' show"

\underbrace{\left[\begin{array}{c} \sum_{i} w_{i} x_{i}+b \\ \vdots \end{array}\right]}_{N \times D}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i} & \cdots \\ \vdots & & \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{lll} \vdots \\ w_{i} & \cdots & \cdots \\ \vdots \end{array}\right]}_{D \times 10}+ \underbrace{\left[\begin{array}{ll} b & \cdots \\ \vdots \\ \vdots \end{array}\right]}_{N \times 10 \\ \text { (Broadcasting)}\\ }

What is a Neural Network (NN) anyway?

Fully-connected neural layers（全连接的神经元层）

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

In each layer:

input

output

num of neurals in this layer

Draw how one sample data flows

For a fully-connected neural network, only the number of weight columns in each layer is the hyper-parameter we need to fix first.
Only one word for the evaluation... (no more slices)
- Loss/cost/error func.
- Gradient descent / backward propagation algorithm

And how the shape of data changs.

What is a Neural Network (NN) anyway?

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

Convolution is a specialized kind of linear operation.（卷积层是全连接层的一种特例）

\mathbf{s}=\left[\begin{array}{llll} s_{0} & s_{1} & \cdots & s_{7} \end{array}\right]=\left[\begin{array}{llll} x_{0} & x_{1} & \cdots & x_{4} \end{array}\right] \cdot\left[\begin{array}{cccccccc} w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 & 0 \\ 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 \\ 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 \\ 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 \\ 0 & 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} \end{array}\right]=\mathbf{x} \cdot \mathbf{w}

Flip-and-slide Form

Matrix Form :

D=5, M=4

s(t)=\int x(\tau) w(t-\tau) d \tau=:(x * w)(t)

\begin{aligned} \left(a_{1} x_{1}+a_{1} x_{2}\right) * w &=a_{1}\left(x_{1} * w\right)+a_{2}\left(x_{2} * w\right) \,&(􏰣􏰤􏰫􏰪􏰩􏰎􏰤􏰒􏱐\text{linearity})\\ (x * w)(t-T) &=x(t-T) * w(t) \,&(\text{time invariance}) \end{aligned}

s[n]=\sum_{m=\max (0, n-D)}^{\min (n, M)} x[m] \cdot w[n-m], n=0,1, \ldots, D+M

\begin{aligned} &x[n], n=0,1, \ldots, D-1;\\ &w[n], n=0,1, \ldots, M-1 \end{aligned}

Integral Form

Discrete Form

参数共享 parameter sharing
稀疏交互 sparse interactions

What is a Neural Network (NN) anyway?

Objective:

Input a sample to a function (our NN).
Then evaluate how is it "close" to the truth (also called label)

Neural networks in academic papers.（GW文献中的神经网络）

PRD. 100, 063015 (2019)

Mach. Learn.: Sci. Technol. 1 025014 (2020)

2003.09995

Expert Systems With Applications 151 (2020) 113378

"All of the current GW ML parameter estimation studies are still at the proof-of-principle stage" [2005.03745]

A concise summary of current GW ML parameter estimation studies

Real-time regression

Huerta's Group [PRD, 2018, 97, 044039], [PLB, 2018, 778, 64-70], [1903.01998], [2004.09524]
Fan et al. [SCPMA, 62, 969512 (2019)]
Carrillo et al. [GRG, 2016, 48, 141], [IJMP, 2017, D27, 1850043]
Santos et al. [2003.09995]
*Li et al. [2003.13928] (Bayes neural networks)

Explicit posteriors density

Chua et al. [PRL, 2020, 124, 041102]
- produce Bayesian posteriors using neural networks.
Gabbard et al. [1909.06296] (CVAE)
- produce samples from the posterior.
- 256Hz, 1-sec, 4-D, simulated noise
- Strong agreement between Bilby and the CVAE.
Yamamoto & Tanaka [2002.12095] (CVAE)
- QNM frequencies estimations
Green et al. [2002.07656, 2008.03312] (MAF, CVAE+,Flow-based)
- optimal results for now

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)

\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)

\begin{aligned} p(x | y)=& \frac{\omega}{\sqrt{(2 \pi)^{n}|\operatorname{det} \Sigma(y)|}} \times \\ & \exp \left(-\frac{1}{2} \sum_{i j=1}^{n}\left(x_{i}-\mu_{i}(y)\right) \Sigma_{i j}^{-1}(y)\left(x_{j}-\mu_{j}(y)\right)\right) \end{aligned}

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}

y^{(i)}

\mu^{(i)}, \Sigma^{(i)}, \omega

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)

\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}

y^{(i)}

\mu^{(i)}, \Sigma^{(i)}, \omega

Gabbard et al. [1909.06296] (CVAE)

\log p(x | y)=\mathbb{E}_{q(z | x, y)} \log p(x | y)

\equiv\mathcal{L}_{ELBO} + D_{KL}[q(z|x,y)||p(z|x,y)]

L \sim \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} [-\mathcal{L}_{ELBO} ]

= \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)}\{-\mathbb{E}_{z\sim q(z|x,y)}\log[p(x|z,y)] +D_{KL}[q(z|x,y)||p(z|y)] \}

x^{(i)}

y^{(i)}

\mu_p^{(i)}, \Sigma_p^{(i)}

\mu_q^{(i)}, \Sigma_q^{(i)}

z^{(j)}

\mu^{(j)}, \Sigma^{(j)}

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

L=-\int d x d y \,p_{\mathrm{true}}(x | y) p_{\mathrm{true}}(y) \log p(x | y) = \mathbb{E}_{y\sim p_{true}(y)}\mathbb{E}_{x\sim p_{true}(x|y)} [-\log p(x | y) ]

fixed; costly sampling required

Bayes' theorem

\approx-\frac{1}{N} \sum_{i=1}^{N} \log p\left(x^{(i)} | y^{(i)}\right)

\text{sampling from the likelihood: } \\x^{(i)} \sim p_{\mathrm{true}}(x) , y^{(i)} \sim p_{\mathrm{true}}\left(y | x^{(i)}\right)

Chua et al. [PRL, 2020, 124, 041102] assume a multivariate normal distribution with weights:

x^{(i)}

y^{(i)}

\mu^{(i)}, \Sigma^{(i)}, \omega

Gabbard et al. [1909.06296] (CVAE)

\log p(x | y)=\mathbb{E}_{q(z | x, y)} \log p(x | y)

\equiv\mathcal{L}_{ELBO} + D_{KL}[q(z|x,y)||p(z|x,y)]

L \sim \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} [-\mathcal{L}_{ELBO} ]

= \mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)}\{-\mathbb{E}_{z\sim q(z|x,y)}\log[p(x|z,y)] +D_{KL}[q(z|x,y)||p(z|y)] \}

x^{(i)}

y^{(i)}

\mu_p^{(i)}, \Sigma_p^{(i)}

\mu_q^{(i)}, \Sigma_q^{(i)}

z^{(j)}

\mu^{(j)}, \Sigma^{(j)}

Green et al. [2002.07656] (MAF, CVAE+)

L=\mathbb{E}_{y\sim p_{true}(x)}\mathbb{E}_{x\sim p_{true}(y|x)} \left[-\log \mathcal{N}(0,1)^{n}\left(f^{-1}(x)\right) -\log \left|\operatorname{det} \frac{\partial\left(f_{1}^{-1}, \ldots, f_{n}^{-1}\right)}{\partial\left(x_{1}, \ldots, x_{n}\right)}\right|\right]

Suppose we have a posterior distribution \(p_{true}(x|y)\). (\(y\) is the GW data, \(x\) is the corresponding parameters)

The aim is to train a neural network to give an approximation \(p(x|y)\) to \(p_{true}(x|y)\).
Take the expectation value (over \(y\)) of the cross-entropy (KL divergence) between the two distributions as the loss function:

L=-\int d x d y \,p_{\mathrm{true}}(y | x) p_{\mathrm{true}}(x) \log p(x | y) = \mathbb{E}_{x\sim p_{true}(x)}\mathbb{E}_{y\sim p_{true}(y|x)} [-\log p(x | y) ]

CVAE (Train)

\(Y\) (·, 256)

\(X\) (·, 5)

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

(2, 8)

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

\(X_1\)

\(X_2\)

\(X_n\)

latent space

strain

params

The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE: conditional variational autoencoder

\(Y\) (·, 256)

\(X\) (·, 5)

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

(2, 8)

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

\(X_1\)

\(X_2\)

\(X_n\)

latent space

strain

params

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]

=:L_{ELBO}(P,Q,Y)

The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE (Train)

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder

\mu_X, \Sigma_X

FYI: if dims of latent space is 2, i.e.

\(X_1\)

\(X_2\)

\(X_n\)

latent space

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]

=:L_{ELBO}(P,Q,Y)

The key is to notice that any distribution in \(d\) dimensions can be generated by taking a set of \(d\) variables that are normally distributed and mapping them through a sufficiently complicated function .

CVAE (Test)

\(Y\) (·, 256)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

strain

(2, 8)

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder

\(Y\) (·, 256)

\(X\) (·, 5)

\vec{\mu}(X,Y), \boldsymbol{\Sigma}(X,Y)

\vec{\mu}(Y), \boldsymbol{\Sigma}(Y)

\(X'\) (·, 5)

KL[\mathcal{N}(\vec{\mu_X}, \boldsymbol{\Sigma}^2_X) || \mathcal{N}(\vec{\mu}_Y, \boldsymbol{\Sigma^2}_Y) ]

(2, 8)

sample \(z\) from \(\mathcal{N}\left(\vec{\mu}, \boldsymbol{\Sigma}^{2}\right)\)

\(Y\) (·, 256)

(1, 8)

Training set: \(N=10^6\)

batchsize = 512

(·, 8)

\([(\mu_{m_1}, \sigma_{m_1}), ...]\)

strain

params

\log P(X|Y)-\mathcal{D}[Q(z | X,Y) \| P(z | X,Y)]=E_{z \sim Q}[\log P(X | z,Y)]-\mathcal{D}[Q(z | X,Y) \| P(z|Y)]

=:L_{ELBO}(P,Q,Y)

Objective: maximise \(L_{ELBO}\) (与数据点 X 相关联的变分下界)

CVAE

Drawbacks:

KL divergence
\(L_{ELBO}\)

Mystery:

How could it still work that well？
(for simple case only)

The final slice. 💪

ELBO: Evidence Lower Bound

CVAE: conditional variational autoencoder

Building a deep neural network and How could we use it as a density estimator

By He Wang

Building a deep neural network and How could we use it as a density estimator

Abstract: Firstly, I will talk about some basic concepts of deep neural networks and I hope it would help clear up misunderstandings and rumors related to understand how a neural network works, etc. Then, based on these concepts, I will try to briefly review the current GW ML parameter estimation studies (1903.01998, 1909.06296, PRL(2020) 124 041102, 2002.07656, 2008.03312; selected), especially how they try to built up a neural network to estimate the posterior distribution. The relative drawbacks and mysteries of their works are also mentioned.

4,718

He Wang PRO

Knowledge increases by sharing but not by saving.

Building a deep neural network

&

How could we use it as a density estimator

Content

Building a deep neural network and How could we use it as a density estimator

More from He Wang