exposing toxcast as a friendly OPentox api

 

by Daniel Bachler (Douglas Connect)

daniel@douglasconnect.com

 

(use space to proceed through the presentation)

Talk outline

  • What problems does it solve?
  • Demo
  • Tools used & lessons learned
  • Next steps

 

How we organize data in general

  • Encyclopedias - great for many things, but
  • slow to look up just one thing
  • updating means republishing everything
  • every language is a new set of books

Downsides of data as zip files

  • Even simple code needs non-trivial parsing (ToxCast: 20 csv files). Slows down development e.g. machine learning.
  • Parsing code must be re-implemented in every language
  • Data acquisition often manual (not automatically reproducible)
  • Annoying to find overlapping compounds tested in several databases (implement N parsers, harmonize, ...)
  • Updating means republishing everything

toxcast as an opentox api

  • JSON over HTTP (REST)
  • Can be accessed with a browser over the internet
  • Or consumed from workflow tools, machine learning software, ...
  • Has rich filtering - only query the data you need, get it instantaneously

Let's take a look

What about a nice data browser?

(work in progress)

Use Cases

  • Get data into KNIME directly from the Api (DEMO!)
  • Query compounds in toxcast from code (DEMO!)
  • Find compounds common in two data sources (DEMO!)

Let's recap

Advantages of data APIs

  • Instant access to data / metadata (No unzipping, parsing...)
  • Works with any programming language / workflow tool
  • Works over standard internet protocols (passes firewalls)
  • Same code can be used to expose data publicly or within an institiution

Behind the scenes: Zipfile ⇨ Api

  1. Write OpenAPi/Swagger definition
  2. Write small importer to download official zip, store it in datastore (currently Elastic Search for ToxCast and ToxRefDB)
  3. Generate scaffold for API with swagger tools (in our case python flask)
  4. Implement API (ToxCast & ToxRef: about 150 LOC of python)
  5. Write docker files & kubernetes descriptions for easy deployment and sharing

Ontologies

  • Use x- extension syntax in swagger definition to annotate JSON result fields with ontology terms. This will help with search and matching compatible data / modelling services
definitions:
  Compound:
    type: object
    properties:
      chid:
        type: integer
        description: "Internal identifier of compounds within Toxcast"
      chnm:
        type: string
        description: "Chemical name (as stored in ToxCast)"
      casn:
        type: string
        description: "CAS Number as stored in Toxcast. Can be empty string if no valid CAS number is stored in ToxCast."
        x-ontology: http://edamontology.org/data_3102

We also realized

  • Planning for interactive discovery is crucial
  • Precise configuration via URLs (data filtering, ...) is great to pass through to other services

What is next for us

  • Finish ToxRefDB, ToxCast, OpenTGGates APIs
  • Collect feedback, iterate and improve
  • Build on and extend integration with CPSign modelling service as a case study
  • Build a web interface that allows non-programmers to build APIs from CSV files
  • Build a central discovery service so that compatible data sources, modelling services and utilities can be found and matched automatically accross the internet

We could use your help :-)

  • Play with it!
  • Let us know what else you need
  • If you have the know how, implement another data source as a compatible API (we will gladly help!)
  • Tell others about it - standards work best when many people know about them

Let's make data access easy!

thank you!

Made with Slides.com