Deep neural networks &
Gravitational-wave signal recognization

He Wang (王赫)

Institute of Theoretical Physics, CAS

Beijing Normal University

on behalf of the KAGRA collaboration

hewang@mail.bnu.edu.cn / hewang@itp.ac.cn

ML session of a semester @ TsingHua Univ. 2022.07.04

Content

  • Introduction
  • Challenge & Opportunity
  • Attempts on stimulated/real data
  • Matched-filtering Convolutional Neural Network (MFCNN)
  • Configuration & Search Strategy
  • Results: GW events in GWTC-1 & GWTC-2
  • Trends & Outlook

Gravitational-wave Astronomy

  • Advanced LIGO observing since 2015 (two detectors in US), joined by Virgo (Italy) in 2017 and KAGRA (Japan) in Japan in 2020.
  • Best GW hunter in the range 10Hz-10kHz.
  • 100 years from Einstein's prediction to the first LIGO-Virgo detection (GW150914) announcement in 2016.
  • General relativity: "Spacetime tells matter how to move, and matter tells spacetime how to curve."
  • Time-varying quadrupolar mass distributions lead to propagating ripples in spacetime: GWs.

LIGO Hanford (H1)

KAGRA

LIGO Livingston (L1)

Noise power spectral density (one-sided)

  • Matched-filtering Technique (prior-free Neyman-Pearson Framework):
    • It is an optimal linear filter for weak signals buried in the Gaussian and stationary noise \(n(t)\).
    • Works by correlating a known signal model \(h(t)\) (template) with the data.
    • Starting with data: \(d(t) = h(t) + n(t)\).
    • Defining the matched-filtering SNR \(\rho(t)\):

 

 

            where

  • The 4-D search parameter space in the first observation run covered by a template bank to circular binaries for which the spin of the systems is aligned (or antialigned) with the orbital angular momentum of the binary.
  • ~250,000 template waveforms are used for O1 (computationally expansive).

The template that best matches GW150914 event

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2

Challenges in GW Data Analysis

From: LIGO-G2102497

Binary detection rates

  • O3 ~ 1/5 days
  • O4 ~ 1/2 days
  • O5 ~ 3 / day

Simulated Event Stream for a one year duration O4 run

Challenges in GW Data Analysis

  • The current Matched filtering technique is computationally expansive.
  • More GW events are coming...
 
 
  • It's necessary to speed up...
  • It has the potential to discover more...

Opportunities in GW Data Analysis

 

Proof-of-principle studies

Production search studies

Milestones

The number of papers on ML applications to GW data has grown rapidly in recent years (See Cuoco, et al Mach. Learn.: Sci. Technol (2020) for a review or Survey4GWML (https://iphysresearch.github.io/Survey4GWML/)

  • When machine & deep learning meets GW astronomy:
    • Covering more parameter-space (Interpolation)
    • Automatic generalization to new sources (Extrapolation)
    • Resilience to real non-Gaussian noise  (Robustness)

    • Acceleration of existing pipelines (Speed, <0.1ms)

  • Task: Whether or not a given noisy data contain a GW signal?  (classification problem)

 
 

Stimulated background noises

  • CNNs structure performs as well as the standard matched filter method in extracting BBH signals under non-ideal conditions.

Past attempts on stimulated noise

Classification

Feature extraction

Convolutional Neural Network (ConvNet or CNN)

  • Fine-tune the Convolutional Neural Network

 

  • A glimpse of model interpretability using data visualization.

 

  • Identify what kind of feature is learned.

\(\rightarrow\) Deeper means better. But no more than ~3 layer.  Marginal!

Visualization for the high-dimensional feature maps of learned network in layers for bi-class using t-SNE.

the 1st layer

the 2nd layer

the 3rd layer

the last layer

\(\rightarrow\)High sensitivity on the merge part of GW waveform

\(\rightarrow\)Extracted features play a decisive role.

\text{Noise} > 0
\text{Noise} = 0

Occlusion Sensitivity

Past attempts on stimulated noise

Classification

Feature extraction

Convolutional Neural Network (ConvNet or CNN)

  • Fine-tune the Convolutional Neural Network

 

  • A glimpse of model interpretability using data visualization.

 

  • Identify what kind of feature is learned.

\(\rightarrow\) Deeper means better. But no more than ~3 layer.  Marginal!

\(\rightarrow\)High sensitivity on the merge part of GW waveform

\(\rightarrow\)Extracted features play a decisive role.

\text{Noise} > 0
\text{Noise} = 0

Occlusion Sensitivity

Effect of the number of the convolutional layers on signal recognizing accuracy.

Fine-tune Convolutional Neural Network

Visualization of the top activation on average at the \(3\)rd layer projected back to time domain using the deconvolutional network approach

The top activated

The top activated

Attempts on Real LIGO Noise

  • CNNs always works pretty good on stimulated noises.
  • However, when on real noises from LIGO, this approach does not work that well.
  • It's too sensitive against the background + hard to find GW events)

A specific design of the architecture is needed.

GW150914

GW151226

GW151012

Classification

Feature extraction

Convolutional Neural Network (ConvNet or CNN)

 MFCNN
 MFCNN
 MFCNN

Result on Real LIGO Noise

  • CNNs always works pretty good on stimulated noises.
  • However, when on real noises from LIGO, this approach does not work that well.
  • It's too sensitive against the background + hard to find GW events)

Classification

Feature extraction

Convolutional Neural Network (ConvNet or CNN)

A specific design of the architecture is needed.

GW150914

GW151226

GW151012

Motivation

  • With the closely related concepts between the templates and kernels , we attempt to address a question of:

Matched-filtering (cross-correlation with the templates) can be regarded as a convolutional layer with a set of predefined kernels.

>> Is it matched-filtering ?
>> Wait, It can be matched-filtering!
  • In practice, we use matched filters as an essential component of feature extraction in the first part of the CNN for GW detection.

Classification

Feature extraction

Convolutional Neural Network (ConvNet or CNN)

Motivation

(1/2) Review: artificial neural networks (linear part)

  • One sample vs one neural unit
\sum_{i} w_{i} x_{i}+b=w_{1} x_{1}+\cdots+w_{D} x_{D}+b
\underbrace{\left[\sum_{i} w_{i} x_{i}+b\right]}_{1 \times 1}=\underbrace{\left[\begin{array}{lll} \cdots & x_{i} & \cdots \end{array}\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{i} \\ \vdots \end{array}\right]}_{D \times 1}+\underbrace{[b]}_{1 \times 1}
  • \(N\) samples vs one neural unit
\underbrace{\left[\begin{array}{c} \sum_{j} w_{j} x_{i j}+b_{j} \\ \vdots \\ \end{array}\right]}_{N \times 1}=\underbrace{\left[\begin{array}{ccc} \cdots & x_{i j} & \cdots \\ & \vdots \end{array}\right]}_{N \times D} \cdot \underbrace{\left[\begin{array}{c} \vdots \\ w_{j} \\ \vdots \end{array}\right]}_{D\times1}+\underbrace{\left[\begin{array}{c} \vdots \\ b_{j} \\ \vdots \end{array}\right]}_{N\times1}
\underbrace{\left[\begin{array}{lll}\sum_{i} w_{i j} x_{i}+b &\cdots &\end{array} \right]}_{D \times M}=\underbrace{\left[\begin{array}{lll} \cdots & x_{i} & \cdots \end{array}\right]}_{1 \times D} \cdot \underbrace{\left[\begin{array}{lll} \vdots \\ \cdots & w_{i j} & \cdots \\ \vdots \end{array}\right]}_{D \times M} +\underbrace{\left[\begin{array}{lll}b &\cdots &\end{array} \right]}_{\underset{\text{Broadcasting}}{1\times M}}
  • one samples vs \(M\) neural units (one layer)
  • \(N\) samples vs \(M\) neural units (one layer)
\underbrace{\begin{bmatrix} & \vdots& \\ \cdots& \hat{x}_{ik} &\cdots \\ & \vdots& \end{bmatrix}}_{N\times M} = f\left( \underbrace{\begin{bmatrix} & \vdots& \\ \cdots& x_{ij} &\cdots \\ & \vdots & \end{bmatrix}}_{N\times D} \cdot \underbrace{\begin{bmatrix} & \vdots& \\ \cdots& w_{jk} & \cdots\\ & \vdots & \end{bmatrix}}_{D\times M} + \underbrace{\begin{bmatrix} & \vdots & \\ \cdots & b_i & \cdots\\ & \vdots & \end{bmatrix}}_{\underset{\text{Broadcasting}}{N\times M}} \right)
\text{where }f\sim \text{non-linear operation}

Motivation

(2/2) Review: the many facets of convolution operation

\mathbf{s}=\left[\begin{array}{llll} s_{0} & s_{1} & \cdots & s_{7} \end{array}\right]=\left[\begin{array}{llll} x_{0} & x_{1} & \cdots & x_{4} \end{array}\right] \cdot\left[\begin{array}{cccccccc} w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 & 0 \\ 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 & 0 \\ 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 & 0 \\ 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} & 0 \\ 0 & 0 & 0 & 0 & w_{0} & w_{1} & w_{2} & w_{3} \end{array}\right]=\mathbf{x} \cdot \mathbf{w}
s(t)=\int x(\tau) w(t-\tau) d \tau=:(x * w)(t)
\begin{aligned} \left(a_{1} x_{1}+a_{1} x_{2}\right) * w &=a_{1}\left(x_{1} * w\right)+a_{2}\left(x_{2} * w\right) \,&(\text{linearity})\\ (x * w)(t-T) &=x(t-T) * w(t) \,&(\text{time invariance}) \end{aligned}
s[n]=\sum_{m=\max (0, n-D)}^{\min (n, M)} x[m] \cdot w[n-m], n=0,1, \ldots, D+M
\begin{aligned} &x[n], n=0,1, \ldots, D-1;\\ &w[n], n=0,1, \ldots, M-1 \end{aligned}
  • Flip-and-slide Form
  • Integral Form
  • Discrete Form
  • Matrix Form :
D=5, M=4
  • It corresponds to a convolutional layer in deep learning with one kernel (channel);
  • kernel size (K)  of 4; padding (P) is 3; stride (S) is 1.

Time domain

Frequency domain

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\int\tilde{x}_1(f) \cdot \tilde{x}_2(f) e^{2\pi ift}df= x_1(t)*x_2(t)
\int\tilde{x}_1(f) \cdot \tilde{x}^*_2(f) e^{2\pi ift}df= x_1(t)\star x_2(t)

Deep Learning Framework

x_1(t)*x_2^*(-t) = x_1(t)\star x_2(t)

Matched-filtering CNNs (MFCNN)

  • Transform matched-filtering method from frequency domain to time domain.
  • The square of matched-filtering SNR for a given data \(d(t) = n(t)+h(t)\):
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2

\(S_n(|f|)\) is the one-sided average PSD of \(d(t)\)

(whitening)

where

Time domain

Frequency domain

(normalizing)

(matched-filtering)

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\langle h|h \rangle \sim [\bar{h}(t) \ast \bar{h}(-t)]|_{t=0}
\langle d|h \rangle (t) \sim \,\bar{d}(t)\ast\bar{h}(-t)
\bar{S_n}(t)=\int^{+\infty}_{-\infty}S_n^{-1/2}(f)e^{2\pi ift}df
\left\{\begin{matrix} \bar{d}(t) = d(t) * \bar{S}_n(t) \\ \bar{h}(t) = h(t) * \bar{S}_n(t) \end{matrix}\right.
\int\tilde{x}_1(f) \cdot \tilde{x}_2(f) e^{2\pi ift}df= x_1(t)*x_2(t)
\int\tilde{x}_1(f) \cdot \tilde{x}^*_2(f) e^{2\pi ift}df= x_1(t)\star x_2(t)

Deep Learning Framework

x_1(t)*x_2^*(-t) = x_1(t)\star x_2(t)

Matched-filtering CNNs (MFCNN)

  • Transform matched-filtering method from frequency domain to time domain.
  • The square of matched-filtering SNR for a given data \(d(t) = n(t)+h(t)\):
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2

\(S_n(|f|)\) is the one-sided average PSD of \(d(t)\)

(whitening)

where

Time domain

Frequency domain

(normalizing)

(matched-filtering)

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\langle h|h \rangle \sim [\bar{h}(t) \ast \bar{h}(-t)]|_{t=0}
\langle d|h \rangle (t) \sim \,\bar{d}(t)\ast\bar{h}(-t)
\bar{S_n}(t)=\int^{+\infty}_{-\infty}S_n^{-1/2}(f)e^{2\pi ift}df
\left\{\begin{matrix} \bar{d}(t) = d(t) * \bar{S}_n(t) \\ \bar{h}(t) = h(t) * \bar{S}_n(t) \end{matrix}\right.
\int\tilde{x}_1(f) \cdot \tilde{x}_2(f) e^{2\pi ift}df= x_1(t)*x_2(t)
x_1(t)*x_2^*(-t) = x_1(t)\star x_2(t)
\int\tilde{x}_1(f) \cdot \tilde{x}^*_2(f) e^{2\pi ift}df= x_1(t)\star x_2(t)
  • In the 1-D convolution (\(*\)) on Apache MXNet, given input data with shape [batch size, channel, length] :
output[n, i, :] = \sum^{channel}_{j=0} input[n,j,:] \ast weight[i,j,:]

FYI:       \(N_\ast = \lfloor(N-K+2P)/S\rfloor+1\)

(A schematic illustration for a unit of convolution layer)

Deep Learning Framework

Matched-filtering CNNs (MFCNN)

  • Transform matched-filtering method from frequency domain to time domain.
  • The square of matched-filtering SNR for a given data \(d(t) = n(t)+h(t)\):
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2

\(S_n(|f|)\) is the one-sided average PSD of \(d(t)\)

(whitening)

where

Time domain

Frequency domain

(normalizing)

(matched-filtering)

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\langle h|h \rangle \sim [\bar{h}(t) \ast \bar{h}(-t)]|_{t=0}
\langle d|h \rangle (t) \sim \,\bar{d}(t)\ast\bar{h}(-t)
\bar{S_n}(t)=\int^{+\infty}_{-\infty}S_n^{-1/2}(f)e^{2\pi ift}df
\left\{\begin{matrix} \bar{d}(t) = d(t) * \bar{S}_n(t) \\ \bar{h}(t) = h(t) * \bar{S}_n(t) \end{matrix}\right.
\int\tilde{x}_1(f) \cdot \tilde{x}_2(f) e^{2\pi ift}df= x_1(t)*x_2(t)
\int\tilde{x}_1(f) \cdot \tilde{x}^*_2(f) e^{2\pi ift}df= x_1(t)\star x_2(t)

Deep Learning Framework

modulo-N circular convolution

x_1(t)*x_2^*(-t) = x_1(t)\star x_2(t)

Matched-filtering CNNs (MFCNN)

  • Transform matched-filtering method from frequency domain to time domain.
  • The square of matched-filtering SNR for a given data \(d(t) = n(t)+h(t)\):
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2
  • The structure of MFCNN:

Input

Output

C_0 = \mathop{\arg\max}_{C}\rho[1,C,N] \,,\\ N_0 = \mathop{\arg\max}_{N} \,\langle d \mid h\rangle[1,C_0,N]
  • In the meanwhile, we can obtain the optimal time \(N_0\) (relative to the input) of feature response of matching by recording the location of the maxima value corresponding to the optimal template \(C_0\).

Matched-filtering CNNs (MFCNN)

Input

Output

C_0 = \mathop{\arg\max}_{C}\rho[1,C,N] \,,\\ N_0 = \mathop{\arg\max}_{N} \,\langle d \mid h\rangle[1,C_0,N]
  • In the meanwhile, we can obtain the optimal time \(N_0\) (relative to the input) of feature response of matching by recording the location of the maxima value corresponding to the optimal template \(C_0\).
import mxnet as mx
from mxnet import nd, gluon
from loguru import logger

def MFCNN(fs, T, C, ctx, template_block, margin, learning_rate=0.003):
    logger.success('Loading MFCNN network!')
    net = gluon.nn.Sequential()         
    with net.name_scope():
        net.add(MatchedFilteringLayer(mod=fs*T, fs=fs,
                                      template_H1=template_block[:,:1],
                                      template_L1=template_block[:,-1:]))
        net.add(CutHybridLayer(margin = margin))
        net.add(Conv2D(channels=16, kernel_size=(1, 3), activation='relu'))
        net.add(MaxPool2D(pool_size=(1, 4), strides=2))
        net.add(Conv2D(channels=32, kernel_size=(1, 3), activation='relu'))    
        net.add(MaxPool2D(pool_size=(1, 4), strides=2))
        net.add(Flatten())
        net.add(Dense(32))
        net.add(Activation('relu'))
        net.add(Dense(2))
	# Initialize parameters of all layers
    net.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx, force_reinit=True)
    return net
1 sec duration
35 templates used
  • The structure of MFCNN:

Matched-filtering CNNs (MFCNN)

Training Configuration and Search Methodology

  • We use SEOBNRE model [Cao et al. (2017), Liu et al. (2020)] to generate waveform, we only consider circular, spinless binary black holes.
  • The background noises for training/testing are sampled from a small closed set (33*4096s) in the first observation run (O1) in the absence of the segments (4096s) containing the first 3 GW events.
  • Every 5 seconds segment as input of our MF-CNN with a step size of 1 second.
  • The model can scan the whole range of the input segment and output a probability score.
  • In the ideal case, with a GW signal hiding in somewhere, there should be 5 adjacent predictions for it with respect to a threshold.

FYI: sampling rate = 4096Hz

templates waveforms (train/test)
Number 35 1610
Length (sec) 1 5
equal mass

input

Results: GW events in GWTC-1

  • Recovering the three GW events in O1.

Results: GW events in GWTC-1

  • Recovering the three GW events in O1.
  • Recovering the three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.

Results: GW events in GWTC-1

  • Recovering the three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.

Results: GW events in GWTC-1

  • Recovering the three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.
  • (Some results on statistical significance on O1 data &​ glitches identification have skipped here)

 

  • Consider 33 events (with H1 detector data) in O3a (39 events in total), 4 of them are misclassified.
GW170817
GW190814
GW190412
  • Selected events:

Results: GW events in GWTC-1 & GWTC-2

GW170817
GW190814
GW190412

Summary & Lesson Learned

  • Some benefits from MFCNN architecture for GW detection:
  • Recovering all GW events in GWTC-1 and 29 of 33 GW events in GWTC-2 (39 events in total).
  • We realized that:
    • GW templates can be used as likely known features for signal recognition.
    • Generalization of both matched-filtering and neural networks.
    • Linear filters (i.e. matched-filtering) in signal processing can be rewritten as neural layers (i.e. CNNs).
  • Recovering the three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.
  • (Some results on statistical significance on O1 data &​ glitches identification have skipped here)

 

  • Consider 33 events (with H1 detector data) in O3a (39 events in total), 4 of them are misclassified.
  • Selected events:

Results: GW events in GWTC-1 & GWTC-2

Trends & Outlook

 

Proof-of-principle studies

Production search studies

Current paradigm:

  1. Change your input data:  Time series vs time-frequency domain
  2. Pre-feature-extraction to lower the bar
  3. Use more powerful model (“Give Me”):  ResNet/Wavenet/LSTM...
 
 

More related works, see 2005.03745 or Survey4GWML (https://iphysresearch.github.io/Survey4GWML/)

Drawbacks:

  • The output of neural network (Softmax function) is not an optimal and reasonable detection statistic.
  • "The negative log-likelihood cost function always strongly penalizes the most active incorrect prediction. And the correctly classified examples will contribute little to the overall training cost."
    —— I. Goodfellow, Y. Bengio, A. Courville. Deep Learning.
 
 

Softmax function

Score

Pred.

Noise

Noise + Signal

Pred.

Possible ways to resolve the problem:

  • Bridge the gap between ML and MF (MFCNN, 2104.03961)
  • Fast posterior density estimation with ML (He et al. Big Data Mining and Analytics 5, no. 1 (2022): 53–63.)
  • ...
 
 

Detection of early inspiral of GW

 

for _ in range(num_of_audiences):
    print('Thank you for your attention! 🙏')

Trends & Outlook

 

Proof-of-principle studies

Production search studies

Current paradigm:

  1. Change your input data:  Time series vs time-frequency domain
  2. Pre-feature-extraction to lower the bar
  3. Use more powerful model (“Give Me”):  ResNet/Wavenet/LSTM...
 
 

More related works, see 2005.03745 or Survey4GWML (https://iphysresearch.github.io/Survey4GWML/)

Drawbacks:

  • The output of neural network (Softmax function) is not an optimal and reasonable detection statistic.
  • "The negative log-likelihood cost function always strongly penalizes the most active incorrect prediction. And the correctly classified examples will contribute little to the overall training cost."
    —— I. Goodfellow, Y. Bengio, A. Courville. Deep Learning.
 
 

Softmax function

Score

Pred.

Noise

Noise + Signal

Pred.

Possible ways to resolve the problem:

  • Bridge the gap between ML and MF (MFCNN, 2104.03961)
  • Fast posterior density estimation with ML (in preparation)
  • ...
 
 

Detection of early inspiral of GW