UNSUPERVISED FEATURE CONSTRUCTION AND KNOWLEDGE EXTRACTION FROM GENOME-WIDE ASSAYS OF BREAST CANCER WITH DENOISING AUTOENCODERS

Weiyuan

2018.02.01

Outline

Research problem
Background knowledge
Key challenge
Experiment
Conclusion

Outline

Research problem
Background knowledge
Key challenge
Experiment
Conclusion

Research problem

Emerging Techniques

Modern genomic technologies enable people to acquire vast amount of data.
Artificial Neural Networks (ANN) show new ways to do feature extraction.

Old methods

Traditional methods are focused on known-unknowns, i.e. to discover a new disease-related gene. But not suitable for discovering new processes.

Emerging Techniques

Old methods

VS

Emerging Techniques

Old methods

VS

Apply ANN (dA) to large dataset providing new insights

Outline

Research problem
Background knowledge
Key challenge
Experiment
Conclusion

Background Knowledge

Before we start ...

Autoencoder

Kudos to image author(s)

Autoencoder

Kudos to images author(s)

Autoencoder

Kudos to image author(s)

Nonlinearity

dA - Denoising Autoencoder

Kudos to image author(s)

dA - Denoising Autoencoder

Kudos to images author(s)

Outline

Research problem
Background knowledge
Key challenge
Experiment
Conclusion

Key Challenge

Doctor: You need to take a pound of broccoli stalk because node 64726 in our network said so.

Something goes wrong!

Interpretation is much important

Linking features to sample characteristics
Linking feature to transcription factors
Linking feature to patient survival*
Linking feature to biological pathways

Interpretation is much important

Linking features to sample characteristics

Sample Characteristics

ER status (ER: estrogen receptor. Estrogen is a specific hormone from females)
Molecular subtypes
...

How

For all features \(f_i\) from the output of node \(n_i\), get its output range \([lower_i, upper_i]\) on training dataset.

How

For all features \(f_i\) from the output of node \(n_i\), get its output range \([lower_i, upper_i]\) on training dataset.
Split range \([lower_i, upper_i]\) into 10 equally spaced threshold \(T_i = \{t_i^1, t_i^2, ..., t_i^{10} \}\).

How

For all features \(f_i\) from the output of node \(n_i\), get its output range \([lower_i, upper_i]\) on training dataset.
Split range \([lower_i, upper_i]\) into 10 equally spaced threshold \(T_i = \{t_i^1, t_i^2, ..., t_i^{10} \}\).
Binarize the feature \(f_i\) into \(\{true, false\}\) for each \(t_i \in T_i\).

How

For all features \(f_i\) from the output of node \(n_i\), get its output range \([lower_i, upper_i]\) on training dataset.
Split range \([lower_i, upper_i]\) into 10 equally spaced threshold \(T_i = \{t_i^1, t_i^2, ..., t_i^{10} \}\).
Binarize the feature \(f_i\) into \(\{true, false\}\) for each \(t_i \in T_i\).
Select these features have most predictive power toward the target characteristic.

Interpretation is much important

Linking features to sample characteristics
Linking feature to transcription factors
Linking feature to patient survival
Linking feature to biological pathways

Interpretation is much important

Linking feature to patient survival

The output features behave bimodally

Cut off at 0.5 for each feature

Binarize the feature, evaluate the predictive power toward patient survival

Interpretation is much important

Linking features to sample characteristics
Linking feature to transcription factors
Linking feature to patient survival
Linking feature to biological pathways

Outline

Research problem
Background knowledge
Key challenge
Experiment
Conclusion

Datasets

METABRIC (2/3) - discovery: \(2136 \times 2520\)
METABRIC (1/3) - validation: \(2136 \times 2520\)
TCGA - evaluation: \(547 \times 2520\)

Performance of hidden nodes in classifying tumor from normal samples.

Performance of hidden nodes in classifying ER + from ER - samples.

Outline

Research problem
Background knowledge
Key challenge
Experiment
Conclusion

Conclusion

Features automatically generated by dA has good predictive power
Using unsupervised model can fill the gap of the current method.
Future work may involve supervised learning on these auto-generated features to increase accuracy.

deck

By Weiyüen Wu

deck

7 years ago
658

Weiyüen Wu

wooya.me

UNSUPERVISED FEATURE CONSTRUCTION AND KNOWLEDGE EXTRACTION FROM GENOME-WIDE ASSAYS OF BREAST CANCER WITH DENOISING AUTOENCODERS

Outline

Outline

Research problem

Emerging Techniques

Old methods

Emerging Techniques

Old methods

VS

Emerging Techniques

Old methods

VS

Apply ANN (dA) to large dataset providing new insights

Outline

Background Knowledge

Autoencoder

Autoencoder

Autoencoder

dA - Denoising Autoencoder

dA - Denoising Autoencoder

Outline

Key Challenge

Something goes wrong!

Interpretation is much important

Interpretation is much important

Sample Characteristics

How

How

How

How

Interpretation is much important

Interpretation is much important

The output features behave bimodally

Cut off at 0.5 for each feature

Binarize the feature, evaluate the predictive power toward patient survival

Interpretation is much important

Outline

Datasets

Performance of hidden nodes in classifying tumor from normal samples.

Performance of hidden nodes in classifying ER + from ER - samples.

Outline

Conclusion

deck

More from Weiyüen Wu