Galaxy community conference 2014
Johns Hopkins, Baltimore, USA
conference report by
Peter van Heusden <pvh@sanbi.ac.za>
GCC2014: social Media ? YES!
- #GCC2014 website links to blogs and Storify of twitter stream
- Online spaces key to Galaxy 'management':
- IRC: #galaxyproject on irc.freenode.net
- Trello boards for Galaxy development and hackathon
- Galaxy user support: Galaxy Biostar
- Last remaining email holdout:
- Galaxy-Dev mailing list (for how long?)
GCC2014: Hackathon
-
40 people
- Large pile of snacks, coffee and drinks
- Two (+) days
- Outcomes:
- Most active 'idea' cards on Trello board:
- Docker: wrapping Galaxy tools or Galaxy itself in Docker container
- API: BioBlend Python wrapper around API
- Sub-workflows
- Collections as Galaxy tool inputs (and outputs)
- Thoughts on federated Galaxy (GalaxyFarm)
GCC2014: GALAXY ARCHITECTURE IDEAS
- Current Galaxy architecture still drawing from the model that there is a user clicking 'submit' on a job
- At scale, this isn't a good model:
- Workflow runner: workflows are run as jobs, therefore fastest turnaround time is the time to execute a job (on typically cluster, at least 30 seconds)
- Galaxy front-end / back-end splitting
- Front-end increasingly written in Javascript with Backone, Underscore and Handlebars: more responsive UI
- Pulsar (used to be LWR) can run on compute resource, no shared filesystem, communicate with front-end using RabbitMQ (used for Galaxy Main)
BCBIO-NEXTGEN ARCHITECTURE
- NGS analysis automated (but not by Galaxy)
- Disk usage scaling:
- Pipeline analysis steps together saves disk I/O
- Avoid storing intermediate data: 6x size of final data
- bcbio-nextgen scaling driven by need to process 1500 WGS
- Parallelisation:
- Split variant calling by callable region
- "Smart" parallelism driven by knowledge of underlying data
- Configurability is the enemy of scaling
- Supporting lots of scenarios makes optimisation hard
- bcbio-nextgen now available as Docker image and runnable as Galaxy tool
STATE Of THE GALAXY 1
- Developments in the toolshed
- Investigations in new ways of job running
- Galaxy/biostar: Galaxy Q&A
- New work on visualisations
- Data managers: tools for handling reference data
- Data collections (and parameter sweeps)
GALAXY TOOLSHED, 2014 EDITION
- Toolshed is replacing hand-editing of tool_conf.xml
- Recommended to run local toolshed (SANBI does)
- Tool development happens in Mercurial repos
- Galaxy server tool updates don't require restarts
- Tools can have dependencies:
- On other tool repositories (other tools or data types)
- On packages (package manager for Galaxy ??)
- Huge role for Docker here in future
-
Galaxy tools can be moved between repositories as .tar.gz
NEW WAYS OF JOB RUNNING
- Pulsar/(ex-LWR) (John Chilton): Job-execution environment for when you don't have a shared filesystem
-
Appears to Galaxy as just another job runner
- Future direction: workflow engine needed!
- Integration with Gold Allocation Manager
- Essentially fork of Galaxy, required substantial modification to the code base
- Exposes resource limits to Galaxy users, integrates with resource accounting
Galaxy & the users
- Galaxy-User mailing list discontinued
- Replaced by Galaxy Biostar site
- Allows easier searching for commonly answered questions
- Support still appears to be driven by Galaxy Team
- Galaxy usability:
- Trello cards open calling for extensions to track user interaction with Galaxy
- Difficulty to work seriously on topic without detailed Human Computer Interaction studies
- Galaxy User BoF:
- What is a User? Given the large number of technical users, technical questions tend to dominate "how do I?" space
VISUALISATION
- Galaxy Charts allows in-browser visualisation of tabular data
- supports area, bar, box, line, pie charts and scatter plots
- Work ongoing to support visualisations installable as tools
- Not only Trackster, now also:
- PhyloViz: phylogeny tree viewer
- Cirster: circos visualisations
- Sweepster: integrates parameter sweep exploration of genomic regions with Trackster
- Documentation is still patchy
DATA MANAGERS
- Galaxy install has tool-data/ directory that configures shared tool data
- e.g. available genome builds
-
Data managers allow configuring this from the web interface
- In the Admin panel (for site-admins only)
- Run in similar way to normal tools, but output of run isn't in History, but rather visible via Admin panel
- Can be installed from a toolshed
DATASET COLLECTIONS
- Galaxy core now supports dataset collection types:
- List
- Pair (forward / reverse)
- Allow operation on a collection:
- List can be used as tool input
- Tools that need paired reads can take single pair param
- Support in the code (and via API) is currently further along that support in the UI
- Much work needed
- Filtering on collections, collections as tool outputs, workflow re-run / resume
- Parameter sets (like Sweepster, but more general)
CONCLUSION
- Large-scale Galaxy is clearly on the roadmap for the future
- New workflow scheduler and execution engine
- Continue and extend work on collection types
- Develop UI elements for working at scale
- Galaxy can do many quite different things
- Sometimes by forking Galaxy code (but then you lose community)
- Not all code paths are equal: frequently used code paths get bugs noticed and fixed earlier
- Non-Galaxy approaches continue to proliferate: YABI, bcbio-nextgen, Arvados, ADAM
- Lots to learn from
SANBI GALAXY PLANS
- SANBI committed to working on making Galaxy more capable of expressing the workflows we already use
- Combination of bug fixing and internals enhancement
- One post-doc, one developer (hopefully two soon)
- Key goals align with some of Galaxy community
- Workflows as first-class objects
- Data collections everywhere
- Workflows scheduled as workflows
- Fine grained execution tracking
- Data handling improvements (is piping possible?)
- New on-disk storage types (Apache Avro?)
- Continue collaboration with Pipelines group
Galaxy community conference 2014
By Peter van Heusden
Galaxy community conference 2014
- 2,822