Galaxy community conference 2014

Johns Hopkins, Baltimore, USA

conference report by 
Peter van Heusden <pvh@sanbi.ac.za>

GCC2014: social Media ? YES!

  • #GCC2014 website links to blogs and Storify of twitter stream
  • Online spaces key to Galaxy 'management':
  • Galaxy user support: Galaxy Biostar
  • Last remaining email holdout:
    • Galaxy-Dev mailing list (for how long?)

GCC2014: Hackathon

  • 40 people
  • Large pile of snacks, coffee and drinks
  • Two (+) days
  • Outcomes:
    • Most active 'idea' cards on Trello board:
  1. Docker: wrapping Galaxy tools or Galaxy itself in Docker container
  2. API: BioBlend Python wrapper around API
  • Great interaction with fellow developers (core/non-core)
  • Architecture hacking:
    • Sub-workflows
    • Collections as Galaxy tool inputs (and outputs)
    • Thoughts on federated Galaxy (GalaxyFarm)

    GCC2014: GALAXY ARCHITECTURE IDEAS

    • Current Galaxy architecture still drawing from the model that there is a user clicking 'submit' on a job
    • At scale, this isn't a good model:
      • Workflow runner: workflows are run as jobs, therefore fastest turnaround time is the time to execute a job (on typically cluster, at least 30 seconds)
      • Galaxy front-end / back-end splitting
        • Front-end increasingly written in Javascript with Backone, Underscore and Handlebars: more responsive UI
        • Pulsar (used to be LWR) can run on compute resource, no shared filesystem, communicate with front-end using RabbitMQ (used for Galaxy Main)

    BCBIO-NEXTGEN ARCHITECTURE

    • NGS analysis automated (but not by Galaxy)
    • Disk usage scaling:
      • Pipeline analysis steps together saves disk I/O
      • Avoid storing intermediate data: 6x size of final data
      • bcbio-nextgen scaling driven by need to process 1500 WGS 
    • Parallelisation:
      • Split variant calling by callable region
      • "Smart" parallelism driven by knowledge of underlying data
    • Configurability is the enemy of scaling
      • Supporting lots of scenarios makes optimisation hard
    • bcbio-nextgen now available as Docker image and runnable as Galaxy tool

    STATE Of THE GALAXY 1

    • Developments in the toolshed
    • Investigations in new ways of job running
    •  Galaxy/biostar: Galaxy Q&A
    • New work on visualisations
    • Data managers: tools for handling reference data
    • Data collections (and parameter sweeps)

    GALAXY TOOLSHED, 2014 EDITION

    • Toolshed is replacing hand-editing of tool_conf.xml
    • Recommended to run local toolshed (SANBI does)
      • Tool development happens in Mercurial repos
      • Galaxy server tool updates don't require restarts
    • Tools can have dependencies:
      • On other tool repositories (other tools or data types)
      • On packages (package manager for Galaxy ??)
        • Huge role for Docker here in future
    • Galaxy tools can be moved between repositories as .tar.gz

    NEW WAYS OF JOB RUNNING

    1. Pulsar/(ex-LWR) (John Chilton): Job-execution environment for when you don't have a shared filesystem
    • Appears to Galaxy as just another job runner
    • Future direction: workflow engine needed!
  • Galaxy on a resource-limited Cluster (Nikolay Vazov, UIO):
    • Integration with Gold Allocation Manager
    • Essentially fork of Galaxy, required substantial modification to the code base
    • Exposes resource limits to Galaxy users, integrates with resource accounting

    Galaxy & the users

    • Galaxy-User mailing list discontinued
    • Replaced by Galaxy Biostar site
      • Allows easier searching for commonly answered questions
      • Support still appears to be driven by Galaxy Team
    • Galaxy usability: 
      • Trello cards open calling for extensions to track user interaction with Galaxy
      • Difficulty to work seriously on topic without detailed Human Computer Interaction studies
    • Galaxy User BoF: 
      • What is a User? Given the large number of technical users, technical questions tend to dominate "how do I?" space

    VISUALISATION

    • Galaxy Charts allows in-browser visualisation of tabular data
      • supports area, bar, box, line, pie charts and scatter plots
    • Work ongoing to support visualisations installable as tools
    • Not only Trackster, now also:
      • PhyloViz: phylogeny tree viewer
      • Cirster: circos visualisations
      • Sweepster: integrates parameter sweep exploration of genomic regions with Trackster
    • Documentation is still patchy

    DATA MANAGERS

    • Galaxy install has tool-data/ directory that configures shared tool data
      • e.g. available genome builds
    • Data managers allow configuring this from the web interface
      • In the Admin panel (for site-admins only)
      • Run in similar way to normal tools, but output of run isn't in History, but rather visible via Admin panel
      • Can be installed from a toolshed

    DATASET COLLECTIONS

    • Galaxy core now supports dataset collection types:
      • List
      • Pair (forward / reverse)
    • Allow operation on a collection:
      • List can be used as tool input
    • Tools that need paired reads can take single pair param
    • Support in the code (and via API) is currently further along that support in the UI
    • Much work needed
      • Filtering on collections, collections as tool outputs, workflow re-run / resume
      • Parameter sets (like Sweepster, but more general)

    CONCLUSION

    • Large-scale Galaxy is clearly on the roadmap for the future
      • New workflow scheduler and execution engine
      • Continue and extend work on collection types
      • Develop UI elements for working at scale
    • Galaxy can do many quite different things
      • Sometimes by forking Galaxy code (but then you lose community)
      • Not all code paths are equal: frequently used code paths get bugs noticed and fixed earlier
    • Non-Galaxy approaches continue to proliferate: YABI, bcbio-nextgen, Arvados, ADAM
      • Lots to learn from

    SANBI GALAXY PLANS

    • SANBI committed to working on making Galaxy more capable of expressing the workflows we already use
    • Combination of bug fixing and internals enhancement
    • One post-doc, one developer (hopefully two soon)
    • Key goals align with some of Galaxy community
      • Workflows as first-class objects
      • Data collections everywhere
      • Workflows scheduled as workflows
      • Fine grained execution tracking
      • Data handling improvements (is piping possible?)
      • New on-disk storage types (Apache Avro?)
    • Continue collaboration with Pipelines group

    Galaxy community conference 2014

    By Peter van Heusden

    Galaxy community conference 2014

    • 2,636