GW Data Analysis

& Deep Learning:
Advanced

2023 Summer School on GW @TianQin

He Wang (王赫)

2023/08/22

ICTP-AP, UCAS

# GWDA: GAN

Generative Adversarial Networks

McGinn, J, C Messenger, M J Williams, and I S Heng. “Generalised Gravitational Wave Burst Generation with Generative Adversarial Networks.” Classical and Quantum Gravity 38, no. 15 (June 30, 2021): 155005.
Lopez, Melissa, Vincent Boudart, Kerwin Buijsman, Amit Reza, and Sarah Caudill. “Simulating Transient Noise Bursts in LIGO with Generative Adversarial Networks.” arXiv:2203.06494, March 12, 2022.
Lopez, Melissa, Vincent Boudart, Stefano Schmidt, and Sarah Caudill. “Simulating Transient Noise Bursts in LIGO with Gengli.” arXiv:2205.09204, May 18, 2022.
Yan, Jianqi, Alex P Leung, and C Y Hui. “On Improving the Performance of Glitch Classification for Gravitational Wave Detection by Using Generative Adversarial Networks.” Monthly Notices of the Royal Astronomical Society, July 27, 2022, stac1996.
Dooney, Tom, Stefano Bromuri, and Lyana Curier. “DVGAN: Stabilize Wasserstein GAN Training for Time-Domain Gravitational Wave Physics.” arXiv:2209.13592, September 29, 2022.
Powell, Jade, Ling Sun, Katinka Gereb, Paul D Lasky, and Markus Dollmann. “Generating Transient Noise Artefacts in Gravitational-Wave Detector Data with Generative Adversarial Networks.” Classical and Quantum Gravity 40, no. 3 (January 13, 2023): 035006.
Jadhav, Shreejit, Mihir Shrivastava, and Sanjit Mitra. “Towards a Robust and Reliable Deep Learning Approach for Detection of Compact Binary Mergers in Gravitational Wave Data.” arXiv:2306.11797, June 20, 2023.

生成对抗网络

# GWDA: GAN

Generative Adversarial Networks

生成对抗网络

作者 Ian Goodfellow 于2014年NIPS顶会首次提出了GAN的概念

\min _G \max _D V(G, D)=E_{x \sim p_{\text {data }}}[\log D(x)]+E_{z \sim p_z}[\log (1-D(G(z)))]

\min _G \max _D V(G, D)=E_{x \sim p_{\text {data }}}[\log D(x)]+E_{z \sim p_z}[\log (1-D(G(z)))]

模型

目标函数

# GWDA: Flow

Flow Model

Green, Stephen Roland, and Jonathan Gair. “Complete Parameter Inference for GW150914 Using Deep Learning.” Machine Learning: Science and Technology 2, no. 3 (June 16, 2021): 03LT01.
Dax, Maximilian, Stephen R. Green, Jonathan Gair, Jakob H. Macke, Alessandra Buonanno, and Bernhard Schölkopf. “Real-Time Gravitational Wave Science with Neural Posterior Estimation.” Physical Review Letters 127, no. 24 (December 2021): 241103.
Shen, Hongyu, E A Huerta, Eamonn O’Shea, Prayush Kumar, and Zhizhen Zhao. “Statistically-Informed Deep Learning for Gravitational Wave Parameter Estimation.” Machine Learning: Science and Technology 3, no. 1 (November 30, 2021): 015007.
Khan, Asad, E.A. Huerta, and Prayush Kumar. “AI and Extreme Scale Computing to Learn and Infer the Physics of Higher Order Gravitational Wave Modes of Quasi-Circular, Spinning, Non-Precessing Black Hole Mergers.” Physics Letters B 835 (December 10, 2022): 137505.
Williams, Michael J., John Veitch, and Chris Messenger. “Nested Sampling with Normalizing Flows for Gravitational-Wave Inference.” Physical Review D 103, no. 10 (May 2021): 103006.
Cheung, Damon H. T., Kaze W. K. Wong, Otto A. Hannuksela, Tjonnie G. F. Li, and Shirley Ho. “Testing the Robustness of Simulation-Based Gravitational-Wave Population Inference.” ArXiv Preprint ArXiv:2112.06707, December 2021.
Karamanis, Minas, Florian Beutler, John A. Peacock, David Nabergoj, and Uros Seljak. “Accelerating Astronomical and Cosmological Inference with Preconditioned Monte Carlo.” arXiv:2207.05652, July 12, 2022.
Chatterjee, Chayan, and Linqing Wen. “Pre-Merger Sky Localization of Gravitational Waves from Binary Neutron Star Mergers Using Deep Learning.” arXiv:2301.03558, December 30, 2022.
Langendorff, Jurriaan, Alex Kolmus, Justin Janquart, and Chris Van Den Broeck. “Normalizing Flows as an Avenue to Studying Overlapping Gravitational Wave Signals.” Physical Review Letters 130, no. 17 (April 2023): 171402.
...

流模型

\begin{aligned} & \because \int_z P_z(z) d z=1=\int_x P_x(x) d x \\ & \therefore\left|P_z(z) \cdot d z\right|=\left|P_x(x) d x\right| \\ & \because P_x(x)=\left|\frac{d z}{d x}\right| \cdot P_z(z) \\ & \because x=f(z), f \text { is invertible. } \\ & \therefore z=f^{-1}(x) \\ & \therefore P_x(x)=\left|\frac{\partial f^{-1}(x)}{\partial x}\right| \cdot P_z(z) \end{aligned}

\begin{aligned} & \because \int_z P_z(z) d z=1=\int_x P_x(x) d x \\ & \therefore\left|P_z(z) \cdot d z\right|=\left|P_x(x) d x\right| \\ & \because P_x(x)=\left|\frac{d z}{d x}\right| \cdot P_z(z) \\ & \because x=f(z), f \text { is invertible. } \\ & \therefore z=f^{-1}(x) \\ & \therefore P_x(x)=\left|\frac{\partial f^{-1}(x)}{\partial x}\right| \cdot P_z(z) \end{aligned}

Assuming: $x=f(z), z, x \in \mathbb{R}^p$ , $z \sim P_z(z), x \sim P_x(x)$ .

$f$ is continuous, invertible.

p_{\mathrm{y}}(\mathbf{y})

p_{\mathrm{y}}(\mathbf{y})

p_{\mathrm{z}}(\mathbf{z})

p_{\mathrm{z}}(\mathbf{z})

\mathbf{z}

\mathbf{z}

\mathbf{y}

\mathbf{y}

T

T^{-1}

T^{-1}

base density

target density

Flow Model

流模型

# GWDA: Flow

p_{\mathrm{y}}(\mathbf{y})=p_{\mathrm{z}}\left(T^{-1}(\mathbf{y})\right)\left|\operatorname{det} J_{T^{-1}}(\mathbf{y})\right|

p_{\mathrm{y}}(\mathbf{y})=p_{\mathrm{z}}\left(T^{-1}(\mathbf{y})\right)\left|\operatorname{det} J_{T^{-1}}(\mathbf{y})\right|

The main idea of flow-based modeling is to express $\mathbf{y}\in\mathbb{R}^D$ as a transformation $T$ of a real vector $\mathbf{z}\in\mathbb{R}^D$ sampled from $p_{\mathrm{z}}(\mathbf{z})$ :

\mathbf{y}=T(\mathbf{z}) \quad \text { where } \quad \mathbf{z} \sim p_{\mathrm{y}}(\mathbf{z})

\mathbf{y}=T(\mathbf{z}) \quad \text { where } \quad \mathbf{z} \sim p_{\mathrm{y}}(\mathbf{z})

Note: The invertible and differentiable transformation $T$ and the base distribution $p_{\mathrm{z}}(\mathbf{z})$ can have parameters $\{\boldsymbol{\phi}, \boldsymbol{\psi}\}$ of their own, i.e. $T_\boldsymbol{\phi} $ and $p_{\mathrm{z},\boldsymbol{\psi}}(\mathbf{z})$ .

Change of Variables:

p_{\mathrm{y}}(\mathbf{y})=p_{\mathrm{z}}(\mathbf{z})\left|\operatorname{det} J_{T}(\mathbf{z})\right|^{-1} \quad \text { where } \quad \mathbf{u}=T^{-1}(\mathbf{x}) .

p_{\mathrm{y}}(\mathbf{y})=p_{\mathrm{z}}(\mathbf{z})\left|\operatorname{det} J_{T}(\mathbf{z})\right|^{-1} \quad \text { where } \quad \mathbf{u}=T^{-1}(\mathbf{x}) .

J_{T}(\mathbf{z})=\left[\begin{array}{ccc} \frac{\partial T_{1}}{\partial \mathrm{z}_{1}} & \cdots & \frac{\partial T_{1}}{\partial \mathrm{z}_{D}} \\ \vdots & \ddots & \vdots \\ \frac{\partial T_{D}}{\partial \mathrm{z}_{1}} & \cdots & \frac{\partial T_{D}}{\partial \mathrm{z}_{D}} \end{array}\right]

J_{T}(\mathbf{z})=\left[\begin{array}{ccc} \frac{\partial T_{1}}{\partial \mathrm{z}_{1}} & \cdots & \frac{\partial T_{1}}{\partial \mathrm{z}_{D}} \\ \vdots & \ddots & \vdots \\ \frac{\partial T_{D}}{\partial \mathrm{z}_{1}} & \cdots & \frac{\partial T_{D}}{\partial \mathrm{z}_{D}} \end{array}\right]

Equivalently,

The Jacobia $J_{T}(\mathbf{u})$ is the $D \times D$ matrix of all partial derivatives of $T$ given by:

Flow Model

流模型

（Based on 1912.02762）

# GWDA: Flow

Flow Model

流模型

（Based on 1912.02762）

Data: target data $\mathbf{y}\in\mathbb{R}^{15}$ with condition data $\mathbf{x}$ .
Task:
- Fitting a flow-based model $p_{\mathrm{y}}(\mathbf{y} ; \boldsymbol{\theta})$ to a target distribution $p_{\mathrm{y}}^{*}(\mathbf{y})$
- by minimizing KL divergence with respect to the model’s parameters $\boldsymbol{\theta}=\{\boldsymbol{\phi}, \boldsymbol{\psi}\}$ ,
- where $\boldsymbol{\phi}$ are the parameters of $T$ and $\boldsymbol{\psi}$ are the parameters of $p_{\mathrm{z}}(\mathbf{z})=\mathcal{N}(0,\mathbb{I})$ .
Loss function:
Assuming we have a set of samples $\left\{\mathbf{y}_{n}\right\}_{n=1}^{N}\sim p_{\mathrm{y}}^{*}(\mathbf{y})$ ,

Minimizing the above Monte Carlo approximation of the KL divergence is equivalent to fitting the flow-based model to the samples $\left\{\mathbf{y}_{n}\right\}_{n=1}^{N}$ by maximum likelihood estimation.

\begin{aligned} \mathcal{L}(\boldsymbol{\theta}) &=D_{\mathrm{KL}}\left[p_{\mathrm{y}}^{*}(\mathbf{y}) \| p_{\mathrm{y}}(\mathbf{y} ; \boldsymbol{\theta})\right] \\ &=-\mathbb{E}_{p_{\mathbf{y}}^{*}(\mathbf{y})}\left[\log p_{\mathbf{y}}(\mathbf{y} ; \boldsymbol{\theta})\right]+\text { const. } \\ &=-\mathbb{E}_{p_{\mathbf{y}}^{*}(\mathbf{y})}\left[\log p_{\mathrm{z}}\left(T^{-1}(\mathbf{y} ; \boldsymbol{\phi}) ; \boldsymbol{\psi}\right)+\log \left|\operatorname{det} J_{T^{-1}}(\mathbf{y} ; \boldsymbol{\phi})\right|\right]+\mathrm{const} . \end{aligned}

\begin{aligned} \mathcal{L}(\boldsymbol{\theta}) &=D_{\mathrm{KL}}\left[p_{\mathrm{y}}^{*}(\mathbf{y}) \| p_{\mathrm{y}}(\mathbf{y} ; \boldsymbol{\theta})\right] \\ &=-\mathbb{E}_{p_{\mathbf{y}}^{*}(\mathbf{y})}\left[\log p_{\mathbf{y}}(\mathbf{y} ; \boldsymbol{\theta})\right]+\text { const. } \\ &=-\mathbb{E}_{p_{\mathbf{y}}^{*}(\mathbf{y})}\left[\log p_{\mathrm{z}}\left(T^{-1}(\mathbf{y} ; \boldsymbol{\phi}) ; \boldsymbol{\psi}\right)+\log \left|\operatorname{det} J_{T^{-1}}(\mathbf{y} ; \boldsymbol{\phi})\right|\right]+\mathrm{const} . \end{aligned}

\mathcal{L}(\boldsymbol{\theta}) \approx-\frac{1}{N} \sum_{n=1}^{N} \log p_{\mathrm{z}}\left(T^{-1}\left(\mathbf{y}_{n} ; \boldsymbol{\phi}\right) ; \boldsymbol{\psi}\right)+\log \left|\operatorname{det} J_{T^{-1}}\left(\mathbf{y}_{n} ; \boldsymbol{\phi}\right)\right|+\mathrm{const.}

\mathcal{L}(\boldsymbol{\theta}) \approx-\frac{1}{N} \sum_{n=1}^{N} \log p_{\mathrm{z}}\left(T^{-1}\left(\mathbf{y}_{n} ; \boldsymbol{\phi}\right) ; \boldsymbol{\psi}\right)+\log \left|\operatorname{det} J_{T^{-1}}\left(\mathbf{y}_{n} ; \boldsymbol{\phi}\right)\right|+\mathrm{const.}

\mathbb{E}_{p_{\mathbf{y}}^{*}(\mathbf{y})}\left[\log p_{\mathbf{y}}^{*}(\mathbf{y} ; \boldsymbol{\theta})\right]

\mathbb{E}_{p_{\mathbf{y}}^{*}(\mathbf{y})}\left[\log p_{\mathbf{y}}^{*}(\mathbf{y} ; \boldsymbol{\theta})\right]

Rational Quadratic Neural Spline Flows
(RQ-NSF)

# GWDA: Flow

Flow Model

流模型

Rational Quadratic Neural Spline Flows
(RQ-NSF)

# GWDA: Flow

Green, Stephen Roland, and Jonathan Gair. “Complete Parameter Inference for GW150914 Using Deep Learning.” Machine Learning: Science and Technology 2, no. 3 (June 16, 2021): 03LT01.

p_{\mathrm{z}}(\mathbf{z})

p_{\mathrm{z}}(\mathbf{z})

T

T^{-1}

T^{-1}

base density

target density

# GWDA: Transformer

Transformer

Vanilla Transformer: attention

为了简单起见，让我们考虑下面的回归问题: 给定一个输入输出对的数据集 $\left\{\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right)\right\}$ ，如何学习 $f$ 对任意的新输入 $x$ , 去预测输出 $\hat{y}=f(x)$ ?

f(x)=\frac{1}{n} \sum_{i=1}^{n} y_{i}

f(x)=\frac{1}{n} \sum_{i=1}^{n} y_{i}

y_{i}=2 \sin \left(x_{i}\right)+x_{i}^{0.8}+\epsilon

y_{i}=2 \sin \left(x_{i}\right)+x_{i}^{0.8}+\epsilon

\begin{aligned} f(x) &=\sum_{i=1}^{n} \alpha\left(x, x_{i}\right) y_{i} \\ &=\sum_{i=1}^{n} \operatorname{softmax}\left(-\frac{1}{2}\left(x-x_{i}\right)^{2}\right) y_{i} \end{aligned}

\begin{aligned} f(x) &=\sum_{i=1}^{n} \alpha\left(x, x_{i}\right) y_{i} \\ &=\sum_{i=1}^{n} \operatorname{softmax}\left(-\frac{1}{2}\left(x-x_{i}\right)^{2}\right) y_{i} \end{aligned}

Case 1: Average Pooling

Case 2: Nonparametric Attention Pooling

\begin{aligned} f(x) &=\sum_{i=1}^{n} \alpha_\omega\left(x, x_{i}\right) y_{i} \\ &=\sum_{i=1}^{n} \operatorname{softmax}\left(-\frac{1}{2}\left(x-x_{i}\right)^{2}\omega^2\right) y_{i} \end{aligned}

\begin{aligned} f(x) &=\sum_{i=1}^{n} \alpha_\omega\left(x, x_{i}\right) y_{i} \\ &=\sum_{i=1}^{n} \operatorname{softmax}\left(-\frac{1}{2}\left(x-x_{i}\right)^{2}\omega^2\right) y_{i} \end{aligned}

Case 3: Parametric Attention Pooling

f(x)=\sum_{i=1}^n \alpha\left(x, x_i\right) y_i

f(x)=\sum_{i=1}^n \alpha\left(x, x_i\right) y_i

注意力汇聚（attention pooling）公式

# GWDA: Transformer

Transformer

Vanilla Transformer: attention

用 $\alpha$ 表示 attention scoring function，说明了如何将注意力池化的输出计算为各值的加权和。因为注意力的权重是一个概率分布，加权和本质上是一个加权平均。

数学上，假设我们有 query $q \in \mathbb{R}^{q}$ 以及 $m$ key-value 对 $\left(k_{1}, v_{1}\right), \ldots,\left(k_{m}, v_{m}\right)$ , 对所有的 $k_{i} \in \mathbb{R}^{k}$ 和 $v_{i} \in \mathbb{R}^{v}$ 。注意力池化 $f$ 实例化为 values 的加权和：

其中, query $q$ 和 key $k_{i}$ 的注意力权重(标量)是通过注意力评分函数 $a$ 的softmax操作计算的, 该函数将两个向量映射为一个标量：

Scaled Dot-Product Attention

queries $Q \in \mathbb{R}^{n \times d}$ , keys $K \in \mathbb{R}^{m \times d}$ 和 values $V \in \mathbb{R}^{m \times v}$ :

f\left(q,\left(k_{1}, v_{1}\right), \ldots,\left(k_{m}, v_{m}\right)\right)=\sum_{i=1}^{m} \alpha\left(q, k_{i}\right) v_{i} \in \mathbb{R}^{v}

f\left(q,\left(k_{1}, v_{1}\right), \ldots,\left(k_{m}, v_{m}\right)\right)=\sum_{i=1}^{m} \alpha\left(q, k_{i}\right) v_{i} \in \mathbb{R}^{v}

\alpha\left(q, k_{i}\right)=\operatorname{softmax}\left(a\left(q, k_{i}\right)\right)=\frac{\exp \left(a\left(q, k_{i}\right)\right)}{\sum_{j=1}^{m} \exp \left(a\left(q, k_{j}\right)\right)} \in \mathbb{R}

\alpha\left(q, k_{i}\right)=\operatorname{softmax}\left(a\left(q, k_{i}\right)\right)=\frac{\exp \left(a\left(q, k_{i}\right)\right)}{\sum_{j=1}^{m} \exp \left(a\left(q, k_{j}\right)\right)} \in \mathbb{R}

a(Q, K)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d}}\right) \in \mathbb{R}^{n \times m}

a(Q, K)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d}}\right) \in \mathbb{R}^{n \times m}

f\left(Q, K, V\right)= a(Q, K) V \in \mathbb{R}^{n \times v}

f\left(Q, K, V\right)= a(Q, K) V \in \mathbb{R}^{n \times v}

Q/K/V ~ 
[batch_size, 
 len_tokens, 
 dim_features]

[5, 15, 10]

[5, 15, 13]

[5, 15, 11]

[5, 13, 10]

[5, 13, 11]

# GWDA: Transformer

Transformer

Vanilla Transformer: attention

Multi-Head Attention

Transformer 并没有简单地应用单个注意力函数，而是使用了多头注意力。通过单独计算每一个注意力头，最终再将多个注意力头的结果拼接起来作为 multi head attenttion 模块最终的输出，具体公式如下所示：

Multi-head attention combines knowledge of the same attention pooling via different representation subspaces of queries, keys, and values.

[batch_size * num_heads,
len_tokens,
dim_features / num_heads]

Q,K,V ~ [batch_size, len_tokens, dim_features]

[batch_size, len_tokens, dim_features]

\begin{aligned} \text { MultiHeadAttn }(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \cdots, \text { head }_{H}\right) \mathbf{W}^{O} \\ \text { where head }_{i} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}

\begin{aligned} \text { MultiHeadAttn }(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \cdots, \text { head }_{H}\right) \mathbf{W}^{O} \\ \text { where head }_{i} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}

Self-Attention

在 Transformer 的编码器中，我们设置 $Q=K=V$ 。

\text { Attention }(\mathrm{Q}, \mathrm{K}, \mathrm{V})=\operatorname{softmax}\left(\frac{\mathrm{QK}^{\top}}{\sqrt{D_{k}}}\right) \mathrm{V}=\mathrm{AV}

\text { Attention }(\mathrm{Q}, \mathrm{K}, \mathrm{V})=\operatorname{softmax}\left(\frac{\mathrm{QK}^{\top}}{\sqrt{D_{k}}}\right) \mathrm{V}=\mathrm{AV}

[2, 5, 5]

[1, 5, 10]