Project Cognoma

Datathon Hack Night

July 12, 2016

Industrious, 230 S Broad Street
6:00–8:30 PM

What does Project Cognoma do?

Image derived from (CC BY)

genes / features / predictors / x

tumors / samples / observations

mutation / status / outcome / response /

Use Case from Cancer Biologist

  1. Robert's lab is interested in maintenance of mitochondria in cancer.
  2. Interested in developing a therapeutic to inhibit mitophagy. This is likely to have low toxicity.
  3. A priori genelist - PARK2, PINK1, NIX, BNIP3, DRP1, PARL

Goal: Find "Hidden Responders"

- Pathology samples

- Cell lines

- PDX Models

Cares about classifier performance:

Time + $$$$

Project Cognoma:


  1. "Putting machine learning in the hands of cancer biologists"
    • Our tagline and ultimate goal
  2. Project Cognoma is a community project
    • We all benefit from the collective expertise to build a superior product
    • Open science!
  3. Everyone learns something new
    • We will be using cutting edge tech
    • Encourage cross-talk between groups and open collaboration

Current Progress on GitHub

Structure of this Datathon

  1. Break into groups
  2. Groups are assigned topics
  3. Feel free to redefine groups and add new topics
  4. Report on the GitHub at the end
  5. Short (~3 minute) report to the group to close the night

Tonight: Breakout Session

  1. Data
  2. Machine Learning
  3. Backend group
    • Django
    • Task Service
  4. Frontend group
    • Javascript Webapp
  5. Design
  6. Community & Management

Data Group

  1. What is the ideal format of the data?
  2. What processing is necessary to get the data into a tidy format?
  3. Can we begin to explore the data?
    • What does it look like?
    • Can we perform some exploratory analyses?
  4. How should we prepare and preprocess the data? Do we need to do normalization?
  5. What is the licensing of the TCGA data we're using? Can we release data as CC0?

Machine Learning Group

  1. We need supervised machine learning algorithms for binary outcome data. There will be between 10,000 and 30,000 features (genes). There will be between 100 and 11,000 samples. The status will likely be highly unbalanced (e.g. 100 positives & 3000 negatives).
  2. What are possible algorithms (regression, SVM, etc)?
  3. How should we address overfitting?
  4. Should we only report cross-validated performance?
  5. Is scikit-learn a good package to start with?
  6. Can we start drafting the design of the algorithm chooser?

Backend Group

  1. Familiarize with prior work
  2. What technologies should the backend use?
  3. Refine the architecture
  4. What technologies would be good to use from a pedagogical perspective, e.g. Docker?

Frontend Group

  1. What technologies should the frontend use?
  2. How should we design the javascript webapp
  3. What are the best ways to coordinate frontend development?
  4. How can we make a GUI query builder for Hetionet so researchers can identify a set of genes even if they don't know Cypher?

Design Group

  1. Can we create a project logo? Including a favicon?
  2. Do we need accounts? Are there login-free methods of query preservation? See this issue.

Community & Management

  1. How can we ensure lot's of contribution from lot's of contributors?
  2. What is currently making it difficult to contribute or join the project?
  3. How is the best way to coordinate and manage the community?
  4. How do we make sure everyone learns something?

Happy Hacking!

See these slides at:

Cognoma Datathon Meetup on July 12, 2016

By Daniel Himmelstein

Cognoma Datathon Meetup on July 12, 2016

Outline and tasks for the Cognoma Datathon Meetup on July 12, 2016 located at Industrious in Philadelphia. This presentation is released under CC0 unless otherwise noted.

  • 3,193