Daniel Himmelstein
Head of Data Integration at Related Sciences. Digital craftsman of the biodata revolution.
https://doi.org/cgwt
Why L1000 Platform?
Scope of resource
In total, we generated 1,319,138 L1000 profiles from 42,080 perturbagens (
corresponding to 25,200 biological entities (
for a total of 473,647 signatures (consolidating replicates)
Which Genes
Figure S1A: Evaluating dimensionality reduction. Simulation showing the mean percentage (+/− SEM) of 33 benchmark connections recovered from an inferred CMap pilot dataset as a function of number of landmark genes used to train the inference model, indicating that around 1000 landmarks were sufficient to recover 82% of expected connections
Figure 2C: Signature generation and data levels.
(I) Raw bead count and fluorescence intensity measured by Luminex scanners.
(II) Deconvoluted data to assign expression levels to two transcripts measured on the same bead.
(IIIa) Normalization to adjust for non-biological variation.
(IIIb) Inferred expression of 12,328 genes from measurement of 978 landmarks.
(IV) Differential expression values.
(V) Signatures representing collapse of replicate profiles.
Figure 2C: Validation of L1000 probes using shRNA knockdown. MCF7 and PC3 cells transduced with shRNAs targeting 955 landmark genes. Differential expression values (z-scores) were computed for each landmark, and the percentile rank of expression z-scores in the experiment in which it was targeted relative to all other experiments was computed. 841/955 genes (88%) rank in the top 1% of all experiments and 907/955 (95%) rank in the top 5%. ... Middle panel: z-score distribution from all targeted (orange) and non-targeted (white) genes. Distribution from the targeted set is significantly lower than non-targeted (p value < 10−16).
See however https://github.com/dhimmel/lincs/issues/4: imputed target genes are not differentially-expressed in the expected direction
As CSNK1A1 was among the 3,799 genes subjected to shRNA-mediated knockdown, we used the CMap to generate a signature of CSNK1A1 loss of function (LoF). We then queried all compounds in the database against this signature to identify perturbations that phenocopied CSNK1A1 loss. One unannotated compound, BRD-1868, showed strong connectivity to CSNK1A1 knockdown in two cell types. This suggested that BRD-1868 might function as a novel CSNK1A1 inhibitor.
Figure 6B: Discovery of novel CSNK1A1 inhibitor.
https://doi.org/ch4b
there is one sample, the starboard crew vent, that appears distinct from all of the other samples...
The three most abundant families in the starboard crew vent sample are Bacteroidaceae, Ruminococcaceae, and Verrumicrobiaceae (comprising 60.1% of all sequences)
See https://doi.org/ch4r
https://github.com/emreg00/repurpose
The Gist:
Existing studies often assume that the drugs that are in the test set will also appear in the training set, a rather counter-intuitive assumption as, in practice, one is often interested in predicting truly novel drug-disease associations (i.e. for drugs that have no known indications previously). We challenge this assumption by evaluating the effect of having training and test sets in which none of the drugs in one overlaps with the drugs in the other. Accordingly, we implement a disjoint cross validation fold generation method that ensures that the drug-disease pairs are split such that none of the drugs in the training set appear in the test set
Fig. 4. Prediction accuracy (AUC) when each similarity feature used individually in disjoint cross validation. Error bars show standard deviation of AUC over ten runs of ten-fold cross validation.
Luo et al. use an independent set of drug-disease associations, yet, 95% of the drugs in the independent set are also in the original data set (109 out of 115).
On the other hand, Gottlieb et al. create the folds such that 10% of the drugs are hidden instead of 10% of the drug-disease pairs, but they do not ensure that the drugs used to train the model are disjoint from the drugs in the test set.
Previous Approaches
See https://doi.org/ch44
Proceedings of the VLDB Endowment, November 2017
Snorkel Architecture
Declarative Labeling Functions: Snorkel includes a library of declarative operators that encode the most common weak supervision function types, based on our experience with users over the last year. These functions capture a range of common forms of weak supervision, for example:
this approach relies on a selection threshold hyperparameter ϵ which induces a tradeoff space between predictive performance and computational cost.
By Daniel Himmelstein
This presentation is released under a CC BY 4.0 License.
Head of Data Integration at Related Sciences. Digital craftsman of the biodata revolution.