Galaxy community conference 2014
Johns Hopkins, Baltimore, USA
conference report by
Peter van Heusden <pvh@sanbi.ac.za>
GCC2014: social Media ? YES!
#GCC2014
website
links to blogs and Storify of twitter stream
Online spaces key to Galaxy 'management':
IRC: #galaxyproject on irc.freenode.net
Trello boards for
Galaxy development
and
hackathon
Galaxy user support: Galaxy
Biostar
Last remaining email holdout:
Galaxy-Dev mailing list (for how long?)
GCC2014: Hackathon
40 people
Large pile of snacks, coffee and drinks
Two (+) days
Outcomes
:
Most active 'idea' cards on Trello board:
Docker:
wrapping Galaxy tools or Galaxy itself in Docker container
API:
BioBlend
Python wrapper around API
Great
interaction with fellow developers (core/non-core)
Architecture hacking:
Sub-workflows
Collections as Galaxy tool inputs (and outputs)
Thoughts on federated Galaxy (GalaxyFarm)
GCC2014: GALAXY ARCHITECTURE IDEAS
Current Galaxy architecture still drawing from the model that there is a user clicking 'submit' on a job
At scale, this isn't a good model:
Workflow runner: workflows are run as jobs, therefore fastest turnaround time is the time to execute a job (on typically cluster, at least 30 seconds)
Galaxy front-end / back-end splitting
Front-end increasingly written in Javascript with
Backone
,
Underscore
and
Handlebars
: more responsive UI
Pulsar (used to be LWR) can run on compute resource, no shared filesystem, communicate with front-end using RabbitMQ (used for Galaxy Main)
BCBIO-NEXTGEN ARCHITECTURE
NGS analysis automated (but not by Galaxy)
Disk usage scaling:
Pipeline analysis steps together saves disk I/O
Avoid storing intermediate data: 6x size of final data
bcbio-nextgen scaling driven by need to process 1500 WGS
Parallelisation:
Split variant calling by callable region
"Smart" parallelism driven by knowledge of underlying data
Configurability is the enemy of scaling
Supporting lots of scenarios makes optimisation hard
bcbio-nextgen now available as Docker image and runnable as Galaxy tool
STATE Of THE GALAXY
1
Developments in the toolshed
Investigations in new ways of job running
Galaxy/biostar: Galaxy Q&A
New work on visualisations
Data managers: tools for handling reference data
Data collections (and parameter sweeps)
GALAXY TOOLSHED, 2014 EDITION
Toolshed is replacing hand-editing of tool_conf.xml
Recommended to run local toolshed (SANBI does)
Tool development happens in Mercurial repos
Galaxy server tool updates don't require restarts
Tools can have dependencies:
On other tool repositories (other tools or data types)
On packages (package manager for Galaxy ??)
Huge role for Docker here in future
Galaxy tools can be moved between repositories as .tar.gz
NEW WAYS OF JOB RUNNING
Pulsar
/(ex-LWR) (
John Chilton
)
: Job-execution environment for when you don't have a shared filesystem
Appears to Galaxy as just another job runner
Future direction: workflow engine needed!
Galaxy on a resource-limited Cluster (
Nikolay Vazov
, UIO):
Integration
with
Gold Allocation Manager
Essentially fork of Galaxy, required substantial modification to the code base
Exposes resource limits to Galaxy users, integrates with resource accounting
Galaxy & the users
Galaxy-User mailing list discontinued
Replaced by Galaxy Biostar site
Allows easier searching for commonly answered questions
Support still appears to be driven by Galaxy Team
Galaxy usability:
Trello cards open calling for extensions to track user interaction with Galaxy
Difficulty to work seriously on topic without detailed Human Computer Interaction studies
Galaxy User BoF:
What is a User? Given the large number of technical users, technical questions tend to dominate "how do I?" space
VISUALISATION
Galaxy Charts
allows in-browser visualisation of tabular data
supports area, bar, box, line, pie charts and scatter plots
Work ongoing to support visualisations installable as tools
Not only Trackster
, now also:
PhyloViz: phylogeny tree viewer
Cirster:
circos
visualisations
Sweepster: integrates parameter sweep exploration of genomic regions with Trackster
Documentation is still patchy
DATA MANAGERS
Galaxy install has tool-data/ directory that configures shared tool data
e.g. available genome builds
Data managers
allow configuring this from the web interface
In the Admin panel (for site-admins only)
Run in similar way to normal tools, but output of run isn't in History, but rather visible via Admin panel
Can be installed from a toolshed
DATASET COLLECTIONS
Galaxy core now supports
dataset collection
types:
List
Pair (forward / reverse)
Allow operation on a collection:
List can be used as tool input
Tools that need paired reads can take single pair param
Support in the code (and via API) is currently further along that support in the UI
Much work needed
Filtering on collections, collections as tool outputs, workflow re-run / resume
Parameter sets (like Sweepster, but more general)
CONCLUSION
Large-scale Galaxy is clearly on the roadmap for the future
New workflow scheduler and execution engine
Continue and extend work on collection types
Develop UI elements for working at scale
Galaxy can do many quite different things
Sometimes by forking Galaxy code (but then you lose community)
Not all code paths are equal: frequently used code paths get bugs noticed and fixed earlier
Non-Galaxy approaches continue to proliferate:
YABI
,
bcbio-nextgen
,
Arvados
,
ADAM
Lots to learn from
SANBI GALAXY PLANS
SANBI committed to working on making Galaxy more capable of expressing the workflows we already use
Combination of bug fixing and internals enhancement
One post-doc, one developer (hopefully two soon)
Key goals align with some of Galaxy community
Workflows as first-class objects
Data collections everywhere
Workflows scheduled as workflows
Fine grained execution tracking
Data handling improvements (is piping possible?)
New on-disk storage types (
Apache Avro
?)
Continue collaboration with
Pipelines
group
Made with Slides.com