Deep neural networks and GW signal recognization

He Wang (王赫)

ITP-CAS, Webniar, Aug 13rd, 2020

Based on: PhD thesis (HTML); 10.1103/PhysRevD.101.104003

Title:
Deep neural networks and GW signal recognization

Abstract:
Deep learning is a neural-inspired pattern recognition technique that is as effective as conventional signal processing. And It has been shown to have considerable potential to identify gravitational-wave (GW) signals. In this talk, I will first review some related works on the detection and characterization of GW signals and some fundamental probabilistic theory of machine learning. I will then present our recent paper (DOI: 10.1103/physrevd.101.104003) about the effect of matched-filtering convolutional neural networks (MFCNN) we proposed on the GW recognition and identifying generalization properties of gravitational waves. At last, I will briefly cover some ongoing works and plans.

3

Content

  • GW astronomy & data analysis
    • Challenges and Opportunities
  • How machine learning works for GW detection
    • A little bit of theory on ML
    • Past attempts on stimulated/real LIGO noise
  • How MFCNN works for GW detection
    • Matched filtering SNR vs Convolution
    • Search Results on the O1/O2
  • Ongoing works & future plans
🤝
 

Observational Experiment

Theoretical Modeling

Data Analysis

GW astronomy & data analysis

GW astronomy & data analysis

Observational Experiment

Theoretical Modeling

Data Analysis

GW150914

(LVT151012)  \(\rightarrow\)

GW151226

 

GW151012

 

GW170729

GW170809

GW170818

GW170823

GW170121

GW170304

GW170721

 

 

(GW151205)

GW Event Detections

 

O1

 

O2

 

O3

GWTC1 (2019)

1-OGC (2019)

GWTC2 (?)

2-OGC (2020)

...

Anomalous non-Gaussian transients, known as glitches

Lack of GW templates

Inadequate matched-filtering method

A threshold is used on SNR value to build our templates bank with a maximum loss of 3% of its SNR.

\text{SNR}=2\left[\int_{0}^{\infty} \frac{|\widetilde{h}(f)|^{2}}{S_{n}(f)} d f\right]^{1 / 2}

Noise power spectral density

Matched filtering Technique:

Optimal detection technique for templates, with Gaussian and stationary detector noise.

credits G. Guidi

GW Detection: Challenges and Opportunities

Real-time / low-latency analysis on raw big data

Anomalous non-Gaussian transients, known as glitches

Lack of GW templates

Inadequate matched-filtering method

The 4-D search parameter space in O1

covered by the template bank

 

to circular binaries for which the spin of the systems is aligned (or antialigned) with the orbital angular momentum of the binary.

 

~250,000 template waveforms are used.

 

The template that best matches GW150914

GW Detection: Challenges and Opportunities

Real-time / low-latency analysis on raw big data

Anomalous non-Gaussian transients, known as glitches

Lack of GW templates

Inadequate matched-filtering method

How many "trash" events?

LIGO L1 and H1 triggers rates during O1

A 'blip' glitch

GW Detection: Challenges and Opportunities

Real-time / low-latency analysis on raw big data

Anomalous non-Gaussian transients, known as glitches

Lack of GW templates

Real-time / low-latency analysis on raw big data

Inadequate matched-filtering method

A new era of multi-messenger astronomy

GW170817: Very long inspiral "chirp" (>100s) firmly detected by the LIGO-Virgo network,

 

GRB 170817A: 1.74\(\pm\)0.05s later, weak short gamma-ray burst observed by Fermi (also detected by INTEGRAL)

First LIGO-Virgo alert 27 minutes later.

 

GW Detection: Challenges and Opportunities

Anomalous non-Gaussian transients, known as glitches

Lack of GW templates

Inadequate matched-filtering method

Covering more parameter-space (interpolation)

 
 

Automatic generalization to new sources (extrapolation)

Resilience to real non-Gaussian noise  (Robustness)

Acceleration of existing pipelines

(Speed, <0.1ms)

...

 

 

Why Machine Learning ?

 

Proof-of-principle studies

Production search studies

Milestones

Real-time / low-latency analysis of the raw big data

More related works, see Survey4GWML (https://iphysresearch.github.io/Survey4GWML/)

GW Detection: Challenges and Opportunities

How machine learning works for GW detection

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

  • What is "learning" by definition?  (data-driven algorithm)

Mitchell et al. (1997)

  • Tasks:
 

Whether or not a given noisy data contain a GW signal?  (classification problem)

  • Measure:
 

The accuracy of classification:  the number of correct predictions made divided by the total number of predictions made. (supervised learning)

  • Experience:
 

Characteristic features of GW waveform.  (representations learning)

Sampling rate = 4096 Hz

For a GW sample (1 sec): \(n=4096\)

How machine learning works for GW detection

Dataset containing \(N\) examples sampling from true but unknown data generating distribution                          :

\mathbf{X}=\left\{\mathbf{x}^{(i)} \mid i=1,2, \cdots, N\right\},
\mathbf{x} \in \mathbb{R}^{n}

with corressponding ground-truth labels:

\mathbf{Y}=\left\{\mathbf{y}^{(i)} \mid i=1,2, \cdots, N\right\},
\mathbf{y} \in \{0, 1\}

Machine learning model is nothing but a map \(f\) from samples to labels:

where \(\mathbb{\Theta}\) is parameters of the model and the outputs are predicted labels:

\hat{\mathbf{Y}}=\left\{\hat{\mathbf{y}}^{(i)} \mid i=1,2, \cdots, N\right\},
0\le \hat{\mathbf{y}} \le 1
p_{\text {data }}(\mathbf{x})
p_{\text {model }}(\hat{\mathbf{y}}|\mathbf{x} ; \boldsymbol{\Theta})

described by                                           , a parametric family of probability distributions over the same space indexed by \(\Theta\).

\mathbf{X} \longrightarrow \hat{\mathbf{Y}}=f(\mathbf{X} ; \Theta)
\mathbf{x} \longmapsto \hat{\mathbf{y}}=f(\mathbf{x} ; \Theta) \in \mathbb{R}

Dataset containing \(N\) examples sampling from true but unknown data generating distribution                          :

\mathbf{X}=\left\{\mathbf{x}^{(i)} \mid i=1,2, \cdots, N\right\},
\mathbf{x} \in \mathbb{R}^{n}

Sampling rate = 4096 Hz

For a GW sample (1 sec): \(n=4096\)

with coressponding ground-truth labels:

\mathbf{Y}=\left\{\mathbf{y}^{(i)} \mid i=1,2, \cdots, N\right\},
\mathbf{y} \in \{0, 1\}

Machine learning model is nothing but a map \(f\) from samples to labels:

where \(\mathbb{\Theta}\) is parameters of the model and the outputs are predicted labels:

\hat{\mathbf{Y}}=\left\{\hat{\mathbf{y}}^{(i)} \mid i=1,2, \cdots, N\right\},
0\le \hat{\mathbf{y}} \le 1
p_{\text {data }}(\mathbf{x})
p_{\text {model }}(\mathbf{y}|\mathbf{x} ; \boldsymbol{\Theta})
\mathbf{X} \longrightarrow \hat{\mathbf{Y}}=f(\mathbf{X} ; \Theta)

described by                                           , a parametric family of probability distributions over the same space indexed by \(\Theta\).

Objective:

  • For each sample, 
  • Find the best \(\Theta\) that 
\hat{\mathbf{y}}^{(i)} \rightarrow \mathbf{y}^{(i)}
p_{\text {model }}(\mathbf{x} ; \boldsymbol{\Theta}) \rightarrow p_{\text {data }}(\mathbf{x})

How machine learning works for GW detection

\mathbf{x} \longmapsto \hat{\mathbf{y}}=f(\mathbf{x} ; \Theta) \in \mathbb{R}

Sampling rate = 4096 Hz

For a GW sample (1 sec): \(n=4096\)

Objective:

  • For each sample, 
  • Find the best \(\Theta\) that 
\hat{\mathbf{y}}^{(i)} \rightarrow \mathbf{y}^{(i)}
p_{\text {model }}(\mathbf{x} ; \boldsymbol{\Theta}) \rightarrow p_{\text {data }}(\mathbf{x})
\begin{aligned} \boldsymbol{\theta}_{\mathrm{ML}} &= \arg \max _{\theta} p_{\text {model }}(\hat{\mathbf{Y}} | \mathbf{X} ; \boldsymbol{\theta}) \\ &=\arg \max _{\theta} \prod_{i=1}^{N} p_{\text {model }}\left(\hat{\mathbf{y}}^{(i)} | \mathbf{x}^{(i)} ; \boldsymbol{\theta}\right) \\ &=\arg \max _{\theta} \sum_{i=1}^{N} \log p_{\text {model }}\left(\hat{\mathbf{y}}^{(i)} | \mathbf{x}^{(i)} ; \boldsymbol{\theta}\right) \\ &=\arg \max _{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{x}, \mathbf{y} \sim \hat{p}_{\text {data }}} \log p_{\text {model }}(\mathbf{y} | \mathbf{x} ; \boldsymbol{\theta}) \end{aligned}

to construct cost function                   (also called loss func. or error func.)

For classification problem, we always use maximum likelihood estimator for \(\Theta\)

 
J(\Theta)

FYI: Minimizing the KL divergence corresponds exactly to minimizing the cross-entropy (negative log-likelihood of a Bernoulli/Softmax distribution) between the distributions.

 

How machine learning works for GW detection

\boldsymbol{J}(\boldsymbol{\theta})=-\mathbb{E}_{\mathbf{x}, \mathbf{y} \sim \hat{p}_{\text {data }}} \log p_{\text {model }}(\mathbf{y} \mid \mathbf{x} ; \theta) \\
\boldsymbol{\theta}_{\mathrm{ML}}=\arg \min _{\theta} \boldsymbol{J}(\theta)

Map / Algorithm

Input

Output

A number

A sequence

Yes or No

\{a_1, a_2, \dots, a_n\}
\in(0 ,1)
\mathbf{x}
\mathbf{y}
\mathbf{y} = f(\mathbf{w}\cdot\mathbf{x}+\mathbf{b})

Our model / network

Extra: ABC of Machine Learning

Past attempts on stimulated noise (1/3)

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

  • Deeper means better.  No more than 3.

Marginal!

Visualization for the high-dimensional feature maps of learned network in layers for bi-class using t-SNE.

Fine-tune Convolutional Neural Network

\mathbf{y} = f(\mathbf{w}\cdot\mathbf{x}+\mathbf{b})

Past attempts on stimulated noise (2/3)

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

  • Deeper means better.  No more than 3.
  • A glimpse of model interpretability using visualization.

Extracted features play a decisive role.

\text{Noise} = 0
\text{Noise} > 0

Visualization of the top activation on average at the \(3\)rd layer projected back to time domain using the deconvolutional network approach

\mathbf{y} = f(\mathbf{w}\cdot\mathbf{x}+\mathbf{b})

Marginal!

Past attempts on stimulated noise (2/3)

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

  • Deeper means better.  No more than 3.
  • A glimpse of model interpretability using visualization.

Extracted features play a decisive role.

Visualization of the top activation on average at the \(3\)rd layer projected back to time domain using the deconvolutional network approach

\mathbf{y} = f(\mathbf{w}\cdot\mathbf{x}+\mathbf{b})
\text{SNR} = \infty
\text{SNR} = 1
\text{SNR} = 0

Marginal!

Past attempts on stimulated noise (3/3)

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

  • Deeper means better.  No more than 3.
  • A glimpse of model interpretability using visualization.

Extracted features play a decisive role.

Occlusion Sensitivity

  • Identify what kind of feature is learned.

High sensitivity to the peak features of GW.

\mathbf{y} = f(\mathbf{w}\cdot\mathbf{x}+\mathbf{b})

Marginal!

Past attempts on stimulated noise (3/3)

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

  • Deeper means better.  No more than 3.

Marginal!

  • A glimpse of model interpretability using visualization.

Extracted features play a decisive role.

Occlusion Sensitivity

  • Identify what kind of feature is learned.

High sensitivity to the peak features of GW.

\mathbf{y} = f(\mathbf{w}\cdot\mathbf{x}+\mathbf{b})

Past attempts on stimulated/real LIGO noise

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

A specific design of the architecture is needed.

 [as Timothy D. Gebhard et al. (2019)]

  • However, when on real noises from LIGO, this approach does not work that well.

(too sensitive against the background + hard to find GW events)

Past attempts on stimulated/real LIGO noise

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

  • However, when on real noises from LIGO, this approach does not work that well.

(too sensitive against the background + hard to find GW events)

A specific design of the architecture is needed.

 [as Timothy D. Gebhard et al. (2019)]

 MFCNN

 MFCNN

 MFCNN

Our Motivation

  • With the closely related concepts between the templates and kernels , we attempt to address a question of:

Matched-filtering (cross-correlation with the templates) can be regarded as a convolutional layer with a set of predefined kernels.

>>Is it matched-filtering ?
>>Wait, It can be matched-filtering!

Classification

Feature extraction

Convolutional neural network (ConvNet or CNN)

  • In practice, we use matched filters as an essential component of feature extraction in the first part of the CNN for GW detection.

Matched-filtering in time domain

\(S_n(|f|)\) is the one-sided average PSD of \(d(t)\)

(whitening)

where

Time domain

Frequency domain

(normalizing)

(matched-filtering)

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2
\langle h|h \rangle \sim [\bar{h}(t) \ast \bar{h}(-t)]|_{t=0}
\langle d|h \rangle (t) \sim \,\bar{d}(t)\ast\bar{h}(-t)
\bar{S_n}(t)=\int^{+\infty}_{-\infty}S_n^{-1/2}(f)e^{2\pi ift}df
\left\{\begin{matrix} \bar{d}(t) = d(t) * \bar{S}_n(t) \\ \bar{h}(t) = h(t) * \bar{S}_n(t) \end{matrix}\right.
\int\tilde{x}_1(f) \cdot \tilde{x}_2(f) e^{2\pi ift}df= x_1(t)*x_2(t)
x_1(t)*x_2^*(-t) = x_1(t)\star x_2(t)
\int\tilde{x}_1(f) \cdot \tilde{x}^*_2(f) e^{2\pi ift}df= x_1(t)\star x_2(t)
  • The square of matched-filtering SNR for a given data \(d(t) = n(t)+h(t)\):
  • The square of matched-filtering SNR for a given data \(d(t) = n(t)+h(t)\):

\(S_n(|f|)\) is the one-sided average PSD of \(d(t)\)

(whitening)

where

Time domain

Frequency domain

(normalizing)

(matched-filtering)

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2
\langle h|h \rangle \sim [\bar{h}(t) \ast \bar{h}(-t)]|_{t=0}
\langle d|h \rangle (t) \sim \,\bar{d}(t)\ast\bar{h}(-t)
\bar{S_n}(t)=\int^{+\infty}_{-\infty}S_n^{-1/2}(f)e^{2\pi ift}df
\left\{\begin{matrix} \bar{d}(t) = d(t) * \bar{S}_n(t) \\ \bar{h}(t) = h(t) * \bar{S}_n(t) \end{matrix}\right.

Deep Learning Framework

modulo-N circular convolution

\int\tilde{x}_1(f) \cdot \tilde{x}_2(f) e^{2\pi ift}df= x_1(t)*x_2(t)
x_1(t)*x_2^*(-t) = x_1(t)\star x_2(t)
\int\tilde{x}_1(f) \cdot \tilde{x}^*_2(f) e^{2\pi ift}df= x_1(t)\star x_2(t)

Matched-filtering in time domain

  • The square of matched-filtering SNR for a given data \(d(t) = n(t)+h(t)\):

\(S_n(|f|)\) is the one-sided average PSD of \(d(t)\)

(whitening)

where

Time domain

Frequency domain

(normalizing)

(matched-filtering)

\langle h|h \rangle = 4\int^\infty_0\frac{\tilde{h}(f)\tilde{h}^*(f)}{S_n(f)}df
\langle d|h \rangle (t) = 4\int^\infty_0\frac{\tilde{d}(f)\tilde{h}^*(f)}{S_n(f)}e^{2\pi ift}df
\rho^2(t)\equiv\frac{1}{\langle h|h \rangle}|\langle d|h \rangle(t)|^2
\langle h|h \rangle \sim [\bar{h}(t) \ast \bar{h}(-t)]|_{t=0}
\langle d|h \rangle (t) \sim \,\bar{d}(t)\ast\bar{h}(-t)
\bar{S_n}(t)=\int^{+\infty}_{-\infty}S_n^{-1/2}(f)e^{2\pi ift}df
\left\{\begin{matrix} \bar{d}(t) = d(t) * \bar{S}_n(t) \\ \bar{h}(t) = h(t) * \bar{S}_n(t) \end{matrix}\right.

Deep Learning Framework

modulo-N circular convolution

\int\tilde{x}_1(f) \cdot \tilde{x}_2(f) e^{2\pi ift}df= x_1(t)*x_2(t)
x_1(t)*x_2^*(-t) = x_1(t)\star x_2(t)
\int\tilde{x}_1(f) \cdot \tilde{x}^*_2(f) e^{2\pi ift}df= x_1(t)\star x_2(t)

Matched-filtering in time domain

  • In the 1-D convolution (\(*\)), given input data with shape [batch size, channel, length] :
output[n, i, :] = \sum^{channel}_{j=0} input[n,j,:] \ast weight[i,j,:]

FYI:       \(N_\ast = \lfloor(N-K+2P)/S\rfloor+1\)

(A schematic illustration for a unit of convolution layer)

Matched-filtering Convolutional Neural Network (MFCNN)

Input

Output

  • The structure of MFCNN:

Matched-filtering Convolutional Neural Network (MFCNN)

Input

Output

  • The structure of MFCNN:
C_0 = \mathop{\arg\max}_{C}\rho[1,C,N] \,,\\ N_0 = \mathop{\arg\max}_{N} \,\langle d \mid h\rangle[1,C_0,N]
  • In the meanwhile, we can obtain the optimal time \(N_0\) (relative to the input) of feature response of matching by recording the location of the maxima value corresponding to the optimal template \(C_0\).

Training Configuration and Search Methodology

  • The background noises for training/testing are sampled from a closed set (33*4096s) in the first observation run (O1) in the absence of the segments (4096s) containing the first 3 GW events.

FYI: sampling rate = 4096Hz

  • We use SEOBNRE model [Cao et al. (2017)] to generate waveform, we only consider circular, spinless binary black holes.
template waveform (train/test)
Number 35 1610
Length (s) 1 5
equal mass
  • The total mass and mass ratio of training or test data and templates are shown below. The 11 GW events for both O1 and O2 are also shown.
 
  • Sensitivity estimation (ROC)
 

True Positive Rate

False Alarm Rate

Training Configuration and Search Methodology

  • Every 5 seconds segment as input of our MF-CNN with a step size of 1 second.
  • The model can scan the whole range of the input segment and output a probability score.
  • In the ideal case, with a GW signal hiding in somewhere, there should be 5 adjacent predictions for it with respect to a threshold.

Training Configuration and Search Methodology

  • Every 5 seconds segment as input of our MF-CNN with a step size of 1 second.
  • The model can scan the whole range of the input segment and output a probability score.
  • In the ideal case, with a GW signal hiding in somewhere, there should be 5 adjacent predictions for it with respect to a threshold.

input

  • Recovering three GW events in O1.

Search Results on the Real LIGO Recordings

  • Recovering three GW events in O1.

Search Results on the Real LIGO Recordings

Search Results on the Real LIGO Recordings

  • Recovering three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.

Search Results on the Real LIGO Recordings

  • Recovering three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.

Search Results on the Real LIGO Recordings

  • Recovering three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.
  • Our MFCNN can also clearly mark the newly reported GW190412 and GW190814 events in O3a.

Search Results on the Real LIGO Recordings

  • Recovering three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.
  • Our MFCNN can also clearly mark the newly reported GW190412 and GW190814 events in O3a.
  • Statistical significance on O1
    • Count a group of adjacent predictions as one "trigger block".
    • For pure background (non-Gaussian), monotone trend should be observed.
    • In the ideal case, with a GW signal hiding in somewhere, there should be 5 adjacent predictions for it with respect to a threshold.
 

Number of Adjacent prediction

a bump at 5 adjacent predictions

Search Results on the Real LIGO Recordings

  • Recovering three GW events in O1.
  • Recovering all GW events in O2, even including GW170817 event.
  • Our MFCNN can also clearly mark the newly reported GW190412 and GW190814 events in O3a.
  • Statistical significance on O1
    • Count a group of adjacent predictions as one "trigger block".
    • For pure background (non-Gaussian), monotone trend should be observed.
    • In the ideal case, with a GW signal hiding in somewhere, there should be 5 adjacent predictions for it with respect to a threshold.
  • ​Glitches identification
    • According to  GravitySpy Dataset, there are 7368 glitches included in O1 data, 812 of which fall in our trigger set or about 90% of known instrumental glitches is distinguishable. 
 

Conclusions

 
  • Some benefits from MF-CNN architecture:

    • Simple configuration for GW data generation and almost no data pre-processing.

    • It works on a non-stationary background.
    • Easy parallel deployments, multiple detectors can benefit a lot from this design.

    • Efficient searching with a fixed window.
  • The main understanding of the algorithm:
    • GW templates can be used as likely features for matching.
    • Generalization of both matched-filtering and neural networks.
    • Matched-filtering can be rewritten as a convolutional neural layer.
 
  • Parameter estimation (the current “holy grail” of machine learning for GWs.)

  • Machine learning search for continuous GWs \(\rightarrow\) space-based detector.

😐
 
 
 
  • ​An improved MFCNN for:

    • higher sensitivity

    • lower FAR (a metric for estimation is urgently needed)

    • and more kinds of GW sources, other than CBC.

 
 

Ongoing works & future plans

 

Dr. Chris Messenger (University of Glasgow):

For me, it seems completely obvious that all data analysis will be ML in 5-10 years".

2008.03312

Deep neural networks and GW signal recognization

By He Wang

Deep neural networks and GW signal recognization

ITP-CAS, Webniar, Aug 13rd, 2020

  • 633