1998–2004

Trondheim: A lot of coffee, cables, coding, ...

LAN party, TG99

2000: ..a lot more Web!

Working door-to-door with ground breaking Web start-ups

Linpro made open source Web solutions

Håkon Wium Lie at Opera invented CSS

TrollTech made QT incl. KHTML aka WebKit aka Chrome/Safari/Edge

Waldemar Thranes gate 98, Oslo

Web standards

2004–2006

Travelling and reproducibility

Trondheim → Birmingham → Manchester

As a computer science MSc student
I need to Reproduce a neural network model of a brain structure

so that I can use it in a robot that could learn sequences to navigating a maze

AI before it become a buzzword (again)

Computational reproducibility (~2004)

Paper: No code, only equations.

Implementations details missing (e.g. noise)

Can you really trust a C++ reimplementation?

 

def calc_GP(self, i):
  sum = 0.0
  for j in range(self.inputs*2):
    sum += self.w[i,j] * self.STN[j]
  noise = random.uniform(-0.25, 0.25)
  result = sum - self.effect * self.STR[i] + noise
  return self.sigmoid(result)

First time travel

It is the view of this author that the computer science community has an tendency for preferring mathematics in papers even when a source code extract could provide a better explanation, be more understandable and known to be correct.

The lack of included source code leads to incomplete publications that can be hard or impossible to reproduce. This problem is, to a great extent, more present in computer science than in other sciences.

Computer Science is still behind other disciplines in thinking about reproducibility and data sharing

2006–

Time travels in Manchester

2008

Needed domain name, got a new surname

http://soiland-reyes.com/

Carole Goble, Rishi, Donal

Stuart

Paolo

Mannie

Sean

Alex

Rob

Antoon

Jits

Alan

Katy

Pinar

Matt

Alasdair

Robert

Finn

Khalid

..before it became the "mega lab"

Shoaib

Workflow systems

Can we capture and share the computational method?

Taverna (2003—2020)

Workflows combine computational tools almost like LEGO blocks

Permalink: https://s.apache.org/existing-workflow-systems

Computational Data Analysis Workflow Systems

An incomplete list

Please add new entries at the bottom. Try to include links to website and/or source code and ideally DOI for publications.

 
  1. Arvados - CWL-based distributed computing platform for data analysis on massive data sets. https://arvados.org/ https://github.com/arvados/arvados
  2. Apache Taverna http://www.taverna.org.uk/ https://taverna.incubator.apache.org/
  3. Galaxy http://galaxyproject.org/
  4. SHIWA https://www.shiwa-workflow.eu/
  5. Apache Oozie https://oozie.apache.org/
  6. DNANexus https://wiki.dnanexus.com/API-Specification-v1.0.0/IO-and-Run-Specifications https://documentation.dnanexus.com/developer/api/running-analyses/io-and-run-specifications https://wiki.dnanexus.com/API-Specification-v1.0.0/Workflows-and-Analyses https://documentation.dnanexus.com/developer/workflows
  7. BioDT http://www.biodatomics.com/ archived at https://web.archive.org/web/20180609011656/http://www.biodatomics.com/
  8. Agave http://agaveapi.co/live-docs/ https://docs.agaveapi.com/
  9. DiscoveryEnvironment http://www.iplantcollaborative.org/ci/discovery-environment
  10. Wings http://www.wings-workflows.org/
  11. Knime https://www.knime.org/
  12. make, rake, drake, ant, scons & many others. Software development relies heavily on tools to manage workflows related to compiling and packaging applications. For the most part these are file based and usually run on a single node, usually supporting parallel steps (make -j) and in some cases able to dispatch build steps to other machines https://code.google.com/p/distcc/ https://github.com/Factual/drake

Permalink: https://s.apache.org/existing-workflow-systems

Computational Data Analysis Workflow Systems

An incomplete list

Please add new entries at the bottom. Try to include links to website and/or source code and ideally DOI for publications.

In addition to this list, actively developed free/open-source systems should be registered at https://workflows.community/systems

See also: https://github.com/pditommaso/awesome-pipeline

  1. Arvados - CWL-based distributed computing platform for data analysis on massive data sets. https://arvados.org/ https://github.com/arvados/arvados
  2. Apache Taverna http://www.taverna.org.uk/ https://taverna.incubator.apache.org/
  3. Galaxy http://galaxyproject.org/
  4. SHIWA https://www.shiwa-workflow.eu/
  5. Apache Oozie https://oozie.apache.org/
  6. DNANexus https://wiki.dnanexus.com/API-Specification-v1.0.0/IO-and-Run-Specifications https://documentation.dnanexus.com/developer/api/running-analyses/io-and-run-specifications https://wiki.dnanexus.com/API-Specification-v1.0.0/Workflows-and-Analyses https://documentation.dnanexus.com/developer/workflows
  7. BioDT http://www.biodatomics.com/ archived at https://web.archive.org/web/20180609011656/http://www.biodatomics.com/
  8. Agave http://agaveapi.co/live-docs/ https://docs.agaveapi.com/
  9. DiscoveryEnvironment http://www.iplantcollaborative.org/ci/discovery-environment
  10. Wings http://www.wings-workflows.org/
  11. Knime https://www.knime.org/
  12. make, rake, drake, ant, scons & many others. Software development relies heavily on tools to manage workflows related to compiling and packaging applications. For the most part these are file based and usually run on a single node, usually supporting parallel steps (make -j) and in some cases able to dispatch build steps to other machines https://code.google.com/p/distcc/ https://github.com/Factual/drake
  13. Snakemake https://snakemake.github.io
  14. BPipe http://bpipe.org http://docs.bpipe.org/ https://doi.org/10.1093/bioinformatics/bts167
  15. Ruffus https://github.com/cgat-developers/ruffus
  16. NextFlow http://nextflow.io
  17. Luigi. Python package that helps you build complex pipelines of batch jobs http://github.com/spotify/luigi https://luigi.readthedocs.io
  18. SciLuigi. Helper library built on top of Luigi to ease development of Scientific workflows in Luigi: http://github.com/pharmbio/sciluigi
  19. Luigi Analysis Workflow (LAW) https://github.com/riga/law
  20. GATK Queue https://www.broadinstitute.org/gatk/guide/topic?name=queue
  21. Yabi https://ccg.murdoch.edu.au/yabi
  22. seqware Workflows are written in Java and executed using the Oozie Workflow Engine on Hadoop or SGE clusters. Uses Zip64 files to group the workflow definition file, workflow itself, sample settings, and data dependencies in a single file that can be exchanged between SeqWare users or archived. https://seqware.github.io/ https://seqware.github.io/docs/6-pipeline/
  23. Ketrew https://github.com/hammerlab/ketrew
  24. Pegasus http://pegasus.isi.edu/
  25. Apache Airflow https://github.com/apache/airflow
  26. Couler https://github.com/couler-proj/couler - Unified interface for constructing and managing workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.
  27. Cosmos https://cosmos.hms.harvard.edu/documentation/index.html https://doi.org/10.1093/bioinformatics/btu385 Cosmos2: https://github.com/LPM-HMS/COSMOS2 http://cosmos.hms.harvard.edu/COSMOS2/
  28. Pinball https://github.com/pinterest/pinball
  29. bcbio https://bcbio-nextgen.readthedocs.org/en/latest/
  30. Chronos https://github.com/mesos/chronos
  31. Azkaban https://azkaban.github.io/
  32. Apache NiFi https://nifi.apache.org/docs/nifi-docs/html/overview.html
  33. flowr (R-based) http://docs.flowr.space/ https://github.com/sahilseth/flowr
  34. Mistral https://github.com/arteria-project https://wiki.openstack.org/wiki/Mistral#What_is_Mistral.3F https://docs.openstack.org/mistral/latest/user/wf_lang_v2.html
  35. nipype http://nipy.org/nipype/
  36. End of Day https://github.com/joestubbs/endofday
  37. BioDSL https://github.com/maasha/BioDSL
  38. bds (BigDataScript) https://pcingola.github.io/bds/ https://doi.org/10.1093/bioinformatics/btu595
  39. Omics Pipe: uses Ruffus http://sulab.scripps.edu/omicspipe/
  40. Ensembl Hive https://github.com/Ensembl/ensembl-hive
  41. QuickNGS http://bifacility.uni-koeln.de/quickngs/web
  42. GenePattern http://www.broadinstitute.org/cancer/software/genepattern/
  43. Chipster http://chipster.csc.fi/
  44. The Genome Modeling System https://github.com/genome/gms
  45. Cuneiform, A Functional Workflow Language https://github.com/joergen7/cuneiform http://www.cuneiform-lang.org/ https://doi.org/10.1017/S0956796817000119
  46. Anvaya https://doi.org/10.1142/s0219720012500060 http://webapp.cabgrid.res.in/biocomp/Anvaya/ANVAYA_Main.html#HOWTO_INSTALL_ANVAYA
  47. Makeflow http://ccl.cse.nd.edu/software/makeflow/
  48. Apache Airavata http://airavata.apache.org/
  49. Pyflow https://github.com/Illumina/pyflow
  50. Cluster Flow http://clusterflow.io
  51. Unipro UGENE http://ugene.net/ https://doi.org/10.7717/peerj.644
  52. CloudSlang http://www.cloudslang.io/
  53. Stacks http://catchenlab.life.illinois.edu/stacks/
  54. Leaf http://www.francesconapolitano.it/leaf/index.html
  55. omictools http://omictools.com/
  56. Job Description Language. The Job Description Language, JDL, is a high-level, user-oriented language based on Condor classified advertisements for describing jobs and aggregates of jobs such as Direct Acyclic Graphs and Collections. https://edms.cern.ch/ui/file/590869/1/WMS-JDL.pdf
  57. YAWL yet another workflow language https://doi.org/10.1016/j.is.2004.02.002 http://www.yawlfoundation.org/
  58. Triquetrum https://projects.eclipse.org/projects/technology.triquetrum https://github.com/eclipse/triquetrum/
  59. Kronos https://github.com/jtaghiyar/kronos
  60. qsubsec https://github.com/alastair-droop/qsubsec https://doi.org/10.1093/bioinformatics/btv698
  61. YesWorkflow http://yesworkflow.org
  62. gwf - Grid WorkFlow https://github.com/gwforg/gwf http://gwf.readthedocs.io/
  63. Fireworks. https://github.com/materialsproject/fireworks https://github.com/materialsproject/fireworks https://doi.org/10.1002/cpe.3505
  64. NGLess: NGS with less work http://ngless.rtfd.io
  65. pypipegraph https://github.com/TyberiusPrime/pypipegraph
  66. Cromwell https://github.com/broadinstitute/cromwell
  67. Dagobah - Simple DAG-based job scheduler in Python. https://github.com/thieman/dagobah
  68. sushi https://github.com/uzh/sushi
  69. Clinical Trial Processor - A program for processing clinical trials data. http://mircwiki.rsna.org/index.php?title=MIRC_CTP
  70. Noodles http://nlesc.github.io/noodles/
  71. Swift & Swift/T http://swift-lang.org/main/ http://swift-lang.org/Swift-T/
  72. Consonance (runs SeqWare & CWL) https://github.com/Consonance/consonance/wiki
  73. Dog https://github.com/dogtools/dog
  74. Produce https://github.com/texttheater/produce
  75. LONI Pipeline http://pipeline.loni.usc.edu/
  76. Cpipe https://github.com/MelbourneGenomics/cpipe
  77. AWE https://github.com/MG-RAST/AWE
  78. (Py)COMPSs https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/
  79. KLIKO https://github.com/gijzelaerr/kliko
  80. SoS Workflow https://github.com/vatlab/SoS https://vatlab.github.io/sos-docs/ https://doi.org/10.1093/bioinformatics/bty405 https://doi.org/10.1371/journal.pcbi.1006843
  81. XNAT Pipeline Engine https://wiki.xnat.org/display/XNAT/Pipeline+Engine https://wiki.xnat.org/display/XNAT/XNAT+Pipeline+Development+Schema
  82. Metapipe https://github.com/TorkamaniLab/metapipe
  83. OCCAM (Open Curation for Computer Architecture Modeling) https://occam.cs.pitt.edu/
  84. Copernicus http://www.copernicus-computing.org
  85. iRODS Rule Language https://github.com/samuell/irods-cheatsheets/blob/master/irods-rule-lang-full-guide.md
  86. VisTrails https://www.vistrails.org
  87. Bionode Watermill https://github.com/bionode/bionode-watermill
  88. BIOVIA Pipeline Pilot Overview http://accelrys.com/products/collaborative-science/biovia-pipeline-pilot/
  89. Dagman A meta-scheduler for HTCondor https://research.cs.wisc.edu/htcondor/dagman/dagman.html
  90. UNICORE https://www.unicore.eu/docstore/workflow-7.6.0/workflow-manual.html#wf_dialect
  91. Toil (A scalable, efficient, cross-platform and easy-to-use workflow engine in pure Python) https://github.com/BD2KGenomics/toil
  92. Cylc (a workflow engine for cycling systems) https://cylc.github.io/
  93. Autodesk Cloud Compute Canon https://github.com/Autodesk/cloud-compute-cannon
  94. Civet https://github.com/TheJacksonLaboratory/civet
  95. Cumulus https://github.com/Kitware/cumulus
  96. High-performance integrated virtual environment (HIVE) https://hive.biochemistry.gwu.edu
  97. Cloudgene http://cloudgene.uibk.ac.at/cloudgene-yaml
  98. FASTR https://bitbucket.org/bigr_erasmusmc/fastr/ http://fastr.readthedocs.io/en/stable/
  99. BioMake https://github.com/evoldoers/biomake https://doi.org/10.1101/093245
  100. remake https://github.com/richfitz/remake
  101. SciFloware http://www-sop.inria.fr/members/Didier.Parigot/pmwiki/Scifloware/
  102. OpenAlea https://openalea.rtfd.io https://hal.archives-ouvertes.fr/hal-01166298/file/openalea-PradalCohen-Boulakia.pdf
  103. COMBUSTI/O https://github.com/jarlebass/combustio http://hdl.handle.net/10037/9361
  104. BioCloud https://github.com/ccwang002/biocloud-server-kai https://doi.org/10.6342/NTU201601295
  105. Triana http://www.trianacode.org/ (website seems to have been taken over by SEO)
  106. Kepler https://kepler-project.org/
  107. Anduril http://anduril.org/site/
  108. dgsh http://www.dmst.aueb.gr/dds/sw/dgsh/
  109. EDGE bioinformatics: Empowering the Development of Genomics Expertise https://bioedge.lanl.gov/edge_ui/ http://edge.readthedocs.io/ https://lanl-bioinformatics.github.io/EDGE/
  110. Pachyderm http://pachyderm.io/ http://pachyderm.readthedocs.io/en/stable/advanced/advanced.html
  111. Digdag https://www.digdag.io/
  112. Agua / Automated Genomics Utilities Agent http://aguadev.org
  113. BioDepot Workflow Builder (BwB) https://github.com/BioDepot/BioDepot-workflow-builder https://doi.org/10.1101/099010
  114. IMP: a pipeline for reproducible reference-independent integrated metagenomic and metatranscriptomic analyses http://r3lab.uni.lu/web/imp/ https://doi.org/10.1186/s13059-016-1116-8
  115. Butler https://github.com/llevar/butler
  116. adage / yadage https://github.com/diana-hep/adage https://github.com/diana-hep/yadage
  117. HI-WAY: Execution of Scientific Workflows on Hadoop YARN https://github.com/marcbux/Hi-WAY https://openproceedings.org/2017/conf/edbt/paper-248.pdf
  118. OpenMOLE https://github.com/openmole/openmole https://www.openmole.org/ https://doi.org/10.3389/fninf.2017.00021
  119. Biopet https://github.com/biopet/biopet
  120. Nephele https://nephele.niaid.nih.gov/
  121. TOPPAS https://doi.org/10.1021/pr300187f
  122. SBpipe https://pdp10.github.io/sbpipe/ https://github.com/pdp10/sbpipe https://doi.org/10.1186/s12918-017-0423-3
  123. Dray http://dray.it/ (website seems to have been taken over by SEO)
  124. GenomeVIP https://github.com/ding-lab/GenomeVIP https://doi.org/10.1101/gr.211656.116
  125. GridSAM https://sourceforge.net/projects/gridsam/
  126. Roddy https://github.com/eilslabs/Roddy
  127. SciFlo (historical; doesn't seem to be maintained anymore) https://web.archive.org/web/20161118011409/https://sciflo.jpl.nasa.gov/SciFloWiki/FrontPage
  128. GNU Guix Workflow Language https://git.roelj.com/guix/gwl.git#gnu-guix-workflow-language-extension https://github.com/UMCUGenetics/guix-workflows/blob/master/umcu/workflows/rnaseq.scm
  129. Porcupine https://timvanmourik.github.io/Porcupine/
  130. Parsl (a Parallel Scripting Library for Python) http://parsl-project.org https://doi.org/10.1145/3307681.3325400
  131. ECFLOW (Workflow primarily for Meteorological Applications) https://software.ecmwf.int/wiki/display/ECFLOW/ecflow+home
  132. Ophidia http://ophidia.cmcc.it/
  133. WebLicht https://weblicht.sfs.uni-tuebingen.de/
  134. GATE Cloud https://cloud.gate.ac.uk/
  135. SCIPION http://scipion.cnb.csic.es/m/home/ https://github.com/I2PC/scipion/wiki/Creating-a-Protocol
  136. Ergatis http://ergatis.sourceforge.net/
  137. TIGR "Workflow" https://sourceforge.net/projects/tigr-workflow/ http://tigr-workflow.sourceforge.net/
  138. Archivematica https://wiki.archivematica.org/Main_Page (A preservation workflow system that implements the ISO-OAIS standard using gearman/MCP)
  139. Martian http://martian-lang.org/about/
  140. BioMAJ http://genouest.github.io/biomaj/
  141. Conveyor http://conveyor.cebitec.uni-bielefeld.de (retired). https://doi.org/10.1093/bioinformatics/btr040
  142. Biopipe http://www.biopipe.org (appears to be defunct) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403782/
  143. Wildfire http://wildfire.bii.a-star.edu.sg/ https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-69
  144. BioWBI https://bioinformatics.hsanmartino.it/bits_library/library/00079.pdf
  145. BioWMS https://bioinformatics.hsanmartino.it/bits_library/library/00568.pdf
  146. BioMoby http://biomoby.open-bio.org/ https://doi.org/10.1186/1471-2105-7-523
  147. SIBIOS http://ieeexplore.ieee.org/document/1309094/
  148. NGSANE https://github.com/BauerLab/ngsane https://doi.org/10.1093/bioinformatics/btu036
  149. Pwrake https://github.com/misshie/Workflows https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3180464/
  150. Nesoni https://github.com/Victorian-Bioinformatics-Consortium/nesoni
  151. Skam http://skam.sourceforge.net/skam-intro.html
  152. TREVA http://bioinformatics.petermac.org/treva/ https://doi.org/10.1371/journal.pone.0095217
  153. EGene https://www.semanticscholar.org/paper/EGene-a-configurable-pipeline-generation-system-fo-Durham-Kashiwabara/4c0656195b5efcdd3aa7bdcb55fc95a957c150aa https://doi.org/10.1093/bioinformatics/btu366
  154. WEP https://bioinformatics.cineca.it/wep/ https://doi.org/10.1186/1471-2105-14-S7-S11
  155. Microbase http://www.microbasecloud.com/
  156. e-Science Central http://www.esciencecentral.co.uk/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3538293/
  157. Cyrille2 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-96
  158. PaPy https://code.google.com/archive/p/papy/ https://doi.org/10.1186/1471-2105-12-61
  159. JobCenter https://github.com/yeastrc/jobcenter https://doi.org/10.1186/1751-0473-7-8
  160. CoreFlow https://doi.org/10.1016/j.jprot.2014.01.023
  161. dynamic-pipeline https://code.google.com/archive/p/dynamic-pipeline/
  162. XiP http://xip.hgc.jp/wiki/en/Main_Page https://doi.org/10.1093/bioinformatics/bts630
  163. Eoulsan http://www.outils.genomique.biologie.ens.fr/eoulsan/ https://doi.org/10.1093/bioinformatics/bts165
  164. CloudDOE http://clouddoe.iis.sinica.edu.tw/
  165. BioPig https://github.com/JGI-Bioinformatics/biopig https://doi.org/10.1093/bioinformatics/btt528
  166. SeqPig https://github.com/HadoopGenomics/SeqPig https://doi.org/10.1093/bioinformatics/btt601
  167. zymake http://www-personal.umich.edu/~ebreck/code/zymake/
  168. JMS https://github.com/RUBi-ZA/JMS https://doi.org/10.1371/journal.pone.0134273
  169. CLC Genomics Workbench https://www.qiagenbioinformatics.com/products/clc-genomics-workbench/
  170. NG6 http://ng6.toulouse.inra.fr/ https://doi.org/10.1186/1471-2164-13-462
  171. VIBE http://www.incogen.com/vibe/
  172. WDL (Workflow Description Language) https://github.com/broadinstitute/wdl
  173. SciFlow https://github.com/kaizhang/SciFlow (not to be confused with SciFloware and SciFlo).
  174. Bioshake https://github.com/PapenfussLab/bioshake https://doi.org/10.7717/peerj.7223
  175. SciPipe http://scipipe.org
  176. Kapacitor / TICKscripts https://docs.influxdata.com/kapacitor/v1.3/tick/
  177. AiiDA: Automated Interactive Infrastructure and Database for Computational Science http://www.aiida.net/ https://doi.org/10.1016/j.commatsci.2015.09.013
  178. Reflow: a language and runtime for distributed, integrated data processing in the cloud https://github.com/grailbio/reflow
  179. Resolwe: an open source dataflow package for Django framework https://github.com/genialis/resolwe
  180. Yahoo! Pipes (historical) https://en.wikipedia.org/wiki/Yahoo!_Pipes
  181. Walrus https://github.com/fjukstad/walrus
  182. Apache Beam https://beam.apache.org/
  183. CLOSHA https://closha.kobic.re.kr/ https://www.bioexpress.re.kr/go_tutorial http://docplayer.net/19700397-Closha-manual-ver1-1-kobic-korean-bioinformation-center-kogun82-kribb-re-kr-2016-05-08-bioinformatics-workflow-management-system-in-bio-express.html https://doi.org/10.1186/s12859-018-2019-3
  184. WopMars https://github.com/aitgon/wopmars http://wopmars.readthedocs.io/
  185. flowing-clj https://github.com/stain/flowing-clj
  186. Plumbing and Graph https://github.com/plumatic/plumbing
  187. LabView http://www.ni.com/en-us/shop/labview.html
  188. MyOpenLab http://myopenlab.org/
  189. Max/MSP https://cycling74.com/products/max/
  190. NoFlo https://noflojs.org/
  191. Flowstone http://www.dsprobotics.com/flowstone.html
  192. HyperLoom https://code.it4i.cz/ADAS/loom https://code.it4i.cz/ADAS/loom
  193. Dask http://dask.pydata.org/en/latest/ https://github.com/dask/dask
  194. Stimela https://github.com/SpheMakh/Stimela https://github.com/SpheMakh/Stimela/wiki https://www.acru.ukzn.ac.za/~cosmosafari2017/wp-content/uploads/2017/02/makhathini.pdf
  195. JTracker https://jtracker.io/ https://github.com/jtracker-io
  196. PipelineDog http://pipeline.dog/ https://github.com/zhouanbo/pipelinedog https://doi.org/10.1093/bioinformatics/btx759
  197. DALiuGE https://arxiv.org/abs/1702.07617 https://github.com/ICRAR/daliuge https://daliuge.readthedocs.io/
  198. Overseer https://github.com/framed-data/overseer
  199. Squonk https://squonk.it/
  200. GC3Pie https://github.com/uzh/gc3pie
  201. Fractalide https://github.com/fractalide/fractalide
  202. TOGGLe http://toggle.southgreen.fr/ https://doi.org/10.1101/245480
  203. Askalon http://www.askalon.org
  204. Eclipse ICE (The Integrated Computational Environment) https://www.eclipse.org/ice
  205. Sandia Analysis Workbench (SAW) http://www.sandia.gov/saw/
  206. dispel4py https://github.com/dispel4py/dispel4py
  207. Jobber https://pypi.python.org/pypi/Jobber/0.1.4
  208. NeatSeq-Flow http://neatseq-flow.readthedocs.io/
  209. S4M https://bitbucket.org/uqokorn/s4m_base/wiki/Home
  210. Loom http://med.stanford.edu/gbsc/loom.html https://github.com/StanfordBioinformatics/loom http://loom.readthedocs.io/en/latest/templates.html
  211. Watchdog - XML Schema-based bioinformatics workflow system https://www.bio.ifi.lmu.de/software/watchdog/ https://doi.org/10.1186/s12859-018-2107-4 https://github.com/klugem/watchdog
  212. phpflo https://github.com/phpflo/phpflo
  213. BASTet: Berkeley Analysis and Storage Toolkit https://openmsi.nersc.gov/openmsi/client/bastet.html https://biorack.github.io/BASTet/ https://doi.org/10.1109/TVCG.2017.2744479
  214. Tavaxy: Pattern based workflow system for the bioinformatics domain http://www.tavaxy.org/
  215. Ginflow: Decentralised adaptive workflow engine https://ginflow.inria.fr/
  216. SciApps: A cloud-based platform for reproducible bioinformatics workflows https://doi.org/10.1093/bioinformatics/bty439 https://www.sciapps.org/
  217. Stoa: Script Tracking for Observational Astronomy https://github.com/petehague/Stoa
  218. Collective Knowledge (CK) framework http://cknowledge.org/
  219. QosCosGrid (QCG) http://www.qoscosgrid.org/ http://www.qoscosgrid.org/trac/qcg-broker/wiki/qcg-advanced-client%20
  220. High-Throughput Binding Affinity Calculator (HTBAC) https://htbac.readthedocs.io/en/latest/ https://github.com/radical-cybertools/htbac https://arxiv.org/abs/1801.01174
  221. BioWorkbench (Swift-based) https://arxiv.org/abs/1801.03915 https://github.com/mmondelli/bioworkbench
  222. ENVI Task Engine https://gbdxdocs.digitalglobe.com/docs/envi-task-engine https://www.harrisgeospatial.com/Learn/Whitepapers/TabId/2359/ArtMID/10212/ArticleID/17299/Workflow-Tools-in-ENVI.aspx https://envi-py-engine.readthedocs.io/en/latest/index.html
  223. Pypeline https://github.com/cgarciae/pypeln
  224. mpipe http://vmlaker.github.io/mpipe/
  225. idseq-dag https://github.com/chanzuckerberg/idseq-dag
  226. Piper (based upon GATK Queue) https://github.com/NationalGenomicsInfrastructure/piper
  227. Apache Object Oriented Data Technology (OODT) http://oodt.apache.org/
  228. JX Workflow (DSL for Makeflow) https://ccl.cse.nd.edu/software/manuals/jx-quick.html http://ccl.cse.nd.edu/research/papers/jx-escience-2018.pdf
  229. The Adaptable IO System (ADIOS), ADIOS using applications can be the orchestrated into a workflow http://csmd.ornl.gov/adios
  230. GenPipes https://bitbucket.org/mugqic/genpipes https://doi.org/10.1101/459552
  231. Argo https://argoproj.github.io/ https://github.com/argoproj/argo https://github.com/argoproj/argo/blob/master/examples/README.md
  232. Reana https://reana.readthedocs.io/en/latest/ https://github.com/reanahub/reana
  233. Cuisine Framework https://www.astron.nl/~renting/cuisine.html
  234. Niassa https://github.com/oicr-gsi/niassa https://oicr-gsi.github.io/niassa-docs/
  235. pypeFLOW https://github.com/PacificBiosciences/pypeFLOW
  236. Tiny Cloud Engine http://ka.cb.k.u-tokyo.ac.jp/tce/
  237. Xbowflow https://github.com/ChrisSuess/Project-Xbow/tree/master/xbowflow
  238. AdaptiveMd https://github.com/markovmodel/adaptivemd
  239. Meshroom https://github.com/alicevision/meshroom
  240. LSST Data Management https://github.com/lsst/pipe_base
  241. CGAT-core https://github.com/cgat-developers/cgat-core
  242. Prefect https://docs.prefect.io/
  243. Apache SCXML engine https://commons.apache.org/proper/commons-scxml/guide/core-engine.html https://commons.apache.org/proper/commons-scxml/guide/scxml-documents.html
  244. IceProd https://github.com/WIPACrepo/iceprod
  245. AnADAMA2 http://huttenhower.sph.harvard.edu/anadama2
  246. Luna https://luna-lang.org/
  247. Passerelle https://code.google.com/archive/a/eclipselabs.org/p/passerelle
  248. Kurator-Akka https://github.com/kurator-org/kurator-akka
  249. Jug https://doi.org/10.5334/jors.161
  250. Node-RED https://nodered.org/
  251. Databolt Flow https://github.com/d6t/d6tflow
  252. Frictionless Data Package Pipelines https://github.com/frictionlessdata/datapackage-pipelines
  253. DataFlows https://github.com/datahq/dataflows
  254. Volcano https://github.com/volcano-sh/volcano
  255. DataJoint https://datajoint.io/
  256. DIRAC3 (LHCb grid software) https://doi.org/10.1088/1742-6596/219/6/062029
  257. Orange https://orange.biolab.si/ http://jmlr.org/papers/v14/demsar13a.html
  258. Ensemble Toolkit (EnTK) https://radicalentk.readthedocs.io/en/latest/entk.html https://arxiv.org/abs/1602.00678v2
  259. BioQueue http://www.bioqueue.org/ https://github.com/liyao001/BioQueue https://doi.org/10.1093/bioinformatics/btx403
  260. mlr3pipelines https://mlr3pipelines.mlr-org.com/
  261. Kedro - Production-Ready Data & ML Pipelines https://github.com/quantumblacklabs/kedro
  262. DATAVIEW - DATAVIEW is a big data workflow management system. https://github.com/shiyonglu/DATAVIEW
  263. SecDATAVIEW - SecDATAVIEW is a security oriented big data workflow management system compatible with the heterogeneous computing environments. https://github.com/shiyonglu/SecDATAVIEW
  264. Giraffe Tools https://giraffe.tools/
  265. CERAMICCA - Cloud Engine Resource for Accelerated Medical Image Computing for Clinical Applications: https://ceramicca.ensc.sfu.ca
  266. uap - Universal Analysis Pipeline. A workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. https://github.com/yigbt/uap
  267. signac https://signac.io/ https://github.com/glotzerlab/signac-flow
  268. cwltool - reference implementation of Common Workflow Language (CWL) https://github.com/common-workflow-language/cwltool/
  269. CWLEXEC - CWL executor for IBM Spectrum LSF clusters https://github.com/IBMSpectrumComputing/cwlexec
  270. drmr https://github.com/ParkerLab/drmr/ https://drmr.readthedocs.io/
  271. Autosubmit - a versatile tool to manage Weather and Climate Experiments in diverse Supercomputing Environments. https://www.bsc.es/research-and-development/software-and-apps/software-list/autosubmit https://autosubmit.readthedocs.io/en/latest/
  272. JUDI https://doi.org/10.1093/bioinformatics/btz956 https://github.com/ncbi/JUDI
  273. Sumatra - a tool for managing and tracking projects based on numerical simulation or analysis, with the aim of supporting reproducible research https://neuralensemble.org/sumatra/
  274. Netflix Conductor: https://netflix.github.io/conductor/
  275. Pipengine https://doi.org/10.21105/joss.00341 https://github.com/fstrozzi/bioruby-pipengine
  276. MyQueue https://doi.org/10.21105/joss.01844 https://gitlab.com/myqueue/myqueue
  277. Drake R package https://doi.org/10.21105/joss.00550 https://github.com/ropensci/drake (superseded by [targets]{.title-ref})
  278. MaDaTS https://10.21105/joss.00830 https://github.com/dghoshal-lbl/madats
  279. Cadence https://cadenceworkflow.io/
  280. Merlin https://merlin.readthedocs.io/
  281. Janis https://janis.readthedocs.io/en/latest/index.html
  282. AlphaSQL https://github.com/Matts966/alphasql
  283. zeebe https://github.com/zeebe-io/zeebe
  284. durabletask https://github.com/Azure/durabletask
  285. Illumina State Language https://iap-docs.readme.io/docs/wes_illumina-states-language
  286. Moteur http://www.i3s.unice.fr/~johan/publis/MOTEUR-poster-A4.pdf https://indico.cern.ch/event/286/contributions/654974/attachments/530521/731563/userForumEGEE_MOTEUR.pdf
  287. SimStack http://www.simstack.de/
  288. Maestro Workflow Conductor https://github.com/LLNL/maestrowf
  289. HyWare https://doi.org/10.1007/s41060-020-00237-x
  290. HyperFlow https://github.com/hyperflow-wms/
  291. BRANE Framework https://onnovalkering.github.io/brane/
  292. ApolloWF https://apollowf.github.io/
  293. IS-EPOS Platform https://ieeexplore.ieee.org/document/9308147 https://tcs.ah-epos.eu/
  294. pyinvoke: Pythonic task management & command execution, uses python as workflow language. http://www.pyinvoke.org/
  295. targets R package https://cran.r-project.org/package=targets https://doi.org/10.21105/joss.02959
  296. Compi https://doi.org/10.7717/peerj-cs.593 https://github.com/sing-group/compi https://www.sing-group.org/compi
  297. TriggerFlow: Event-based Orchestration of Serverless Workflows https://github.com/triggerflow/triggerflow
  298. Google Cloud Workflows: Orchestrate and automate Google Cloud and HTTP-based API services with serverless workflows. https://cloud.google.com/workflows/docs
  299. PanDA Workflow Management System: https://doi.org/10.1051/epjconf/201921403050 https://doi.org/10.1051/epjconf/201610801003 https://github.com/PanDAWMS
  300. Harvester: https://github.com/HSF/harvester
  301. BD-Processor https://github.com/big-data-processor/bd-processor/
  302. redun (yet another redundant workflow engine) https://github.com/insitro/redun
  303. pyiron (The materials science IDE) https://pyiron.org/
  304. looper (pipeline submitting engine) https://github.com/pepkit/looper
  305. dagster (Python based data orchestration platform) https://dagster.io/
  306. StackStorm (Devops automation engine) https://stackstorm.com/
  307. Geoweaver (compose and execute full-stack deep learning workflows) https://esipfed.github.io/Geoweaver/
  308. Popper: Container-native task automation engine https://github.com/getpopper/popper
  309. Cloud Build: Build, test, and deploy on our serverless CI/CD platform https://cloud.google.com/build
  310. Task/Taskfile: A task runner / simpler Make alternative written in Go https://github.com/go-task/task https://taskfile.dev/
  311. pypyr: task runner for automation pipelines script sequential task workflow steps in yaml conditional execution, loops, error handling & retries https://pypyr.io/
  312. SimTool/Sim2Ls: Jupyter notebook-based pipelines of Simulation Tools for the HUBzero platform lead by nanoHUB https://github.com/hubzero/simtool https://simtool.readthedocs.io/ https://doi.org/10.1371/journal.pone.0264492
  313. SideIO: A Side I/O system framework for hybrid scientific workflow (no project/source code available) https://doi.org/10.1016/j.jpdc.2016.07.001
  314. Flyte https://flyte.org/
  315. StreamFlow https://streamflow.di.unito.it/
  316. Jupyter Workflow https://jupyter-workflow.di.unito.it/
  317. Nnodes: a simple workflow manager for Python functions and command line tools https://github.com/icui/nnodes
  318. Orchest: A GUI for developing, running and managing container workflows https://github.com/orchest/orchest
  319. Wasmflow: platform for building applications out of WebAssembly code containers https://github.com/wasmflow/wasmflow https://wasmflow.com/
  320. HyperShell: cross-platform, high-performance computing utility for processing shell commands over a distributed, asynchronous queue. https://doi.org/10.1145/3491418.3535138 https://github.com/glentner/hyper-shell
  321. Covalent: Pythonic distributed workflow tool used to prototype and run high performance classical and quantum software https://github.com/AgnostiqHQ/covalent
  322. Icolos: workflow manager for structure-based workflows in computational chemistry https://github.com/MolecularAI/Icolos https://doi.org/10.26434/chemrxiv-2022-vqbxg
  323. dwork: Task graph scheduler with a minimalistic API. https://github.com/frobnitzem/dwork
  324. pmake: parallel make developed for use within batch jobs https://docs.olcf.ornl.gov/software/workflows/pmake.html#workflows-pmake
  325. Texera: Collaborative Data Analytics Using Workflows https://github.com/Texera/texera
  326. swif2: The Scientific Workflow Indefatigable Factotum https://scicomp.jlab.org/docs/swif2
  327. Jobflow: jobflow is a library for writing computational workflows https://github.com/materialsproject/jobflow
  328. Balsam: High throughput workflows and automation for HPC https://github.com/argonne-lcf/balsam
  329. Task Vine: Dynamic workflows that run on HPC clusters, GPU clusters, and commercial clouds https://ccl.cse.nd.edu/software/taskvine/
  330. Globus Compute: High Performance Function Serving for Science https://github.com/funcx-faas/funcX
  331. mkite: Distributed computing for high-throughput materials simulation https://github.com/mkite-group
  332. hpcflow: Automated simulate, process, archive workflows on HPC systems https://github.com/hpcflow/hpcflow-new, DOI: 10.12688/materialsopenres.17516.1
  333. Data Version Control: DVC data pipelines as "Makefile" system for machine learning projects https://dvc.org/doc/start/data-management/data-pipelines
  334. ZnFlow: General purpose framework for computational graphs https://github.com/zincware/ZnFlow
  335. Globus Flows: Globus automation services provide secure and reliable task orchestration across a set of heterogeneous resources at scale. https://docs.globus.org/api/flows/
  336. Kestra: Kestra is an infinitely scalable orchestration and scheduling platform, creating, running, scheduling, and monitoring millions of complex pipelines. https://github.com/kestra-io/kestra
  337. Celery: Distributed Task Queue. https://github.com/celery/celery
  338. dflow: Dflow is a Python framework for constructing scientific computing workflows (e.g. concurrent learning workflows) employing Argo Workflows as the workflow engine. https://github.com/deepmodeling/dflow
  339. pydra: A simple dataflow engine with scalable semantics. https://github.com/nipype/pydra
  340. funsies: A python library and execution engine to build reproducible, fault-tolerant, distributed and composable computational workflows. https://github.com/aspuru-guzik-group/funsies
  341. Maize: A graph-based workflow manager for computational chemistry pipelines. https://github.com/MolecularAI/maize
  342. doit: task management & automation tool. https://github.com/pydoit/doit
  343. drama: asynchronous workflow executor engine based on Dramatiq and Apache Kafka https://github.com/KhaosResearch/drama
  344. Titan: Semantic scientific workflow platform based on Drama, Apache Kafka, Avro and Spark https://titan.khaos.uma.es/
  345. disBatch: Tool to distribute a list of computational tasks over a pool of compute resources. The pool can grow or shrink. https://github.com/flatironinstitute/disBatch
  346. dawgz: Directed Acyclic Workflow Graph Scheduling. https://github.com/francois-rozet/dawgz
  347. hummingqueue: Single tenant distributed scheduling. https://github.com/NablaChem/hummingqueue
  348. Ewoks: Python-based meta workflow system for large-scale facilities, with multiple execution bindings https://ewoks.esrf.fr/ https://ewoks.readthedocs.io/
  349. DAGonStar: simple Python-based workflow engine that runs jobs on everything from the local machine to distributed virtual HPC clusters hosted in private and public clouds. https://github.com/DagOnStar/dagonstar
  350. MaRDIFlow: automation of abstracting meta-data embedded in an ontology of mathematical objects https://doi.org/10.48550/arXiv.2405.00028 https://doi.org/10.5281/zenodo.7863519
  351. Kerblam: Rust-based pipeline system for scientific data analysis https://www.kerblam.dev/
  352. PlantCV: Python-based imaging processing pipeline system for plant phenotyping https://plantcv.readthedocs.io/ https://doi.org/10.1016/j.molp.2015.06.005
  353. Maestro: Netflix’s Workflow Orchestrator https://github.com/Netflix/maestro
  354. BEE: Build and Execution Environment https://github.com/lanl/BEE
  355. pipefunc: Lightweight function pipeline (DAG) creation in pure Python for scientific workflows with local and SLURM parallelization and QoL tools https://github.com/pipefunc/pipefunc
  356. LEXIS: orchestration of complex scientific and industrial workflows which can be easily run on distributed HPC and Cloud resources https://lexis-project.eu/web/lexis-platform/
  357. Perqueue: persistent queue. https://gitlab.com/asm-dtu/perqueue
  358. Playbook Workflow Builder: Interactive Construction of Bioinformatics Workflows from a Network of Microservices https://playbook-workflow-builder.cloud/ https://github.com/MaayanLab/Playbook-Workflow-Builder https://doi.org/10.1101/2024.06.08.598037

2014: CWL idea at Bioinformatics Open Source Conference

 

2025: ~14 implementations

cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
  ex: string

outputs:
  classout:
    type: File
    outputSource: compile/classfile

steps:
  untar:
    run: tar-param.cwl
    in:
      tarfile: inp
      extractfile: ex
    out: [example_out]

  compile:
    run: arguments.cwl
    in:
      src: untar/example_out
    out: [classfile]

CWL is a YAML-based standard for workflows

 

Reusable tool definitions

 

Executed in Docker containers

 

Explicit inputs and outputs - can be rewired

Workflow provenance

Recording computational processes

PROV: Standard for provenance

Workflow description

Execution provenance

Time travel:
Common Workflow Language modelled on wfdesc, while CWL provenance came later

Identified the need for multiple implementation levels

Which PROV format?

<prov:wasGeneratedBy>
  <prov:entity prov:ref="ex:ent1"/>
  <prov:activity prov:ref="ex:act1"/>
  <prov:time>2017-10-26T21:32:52Z</prov:time>
  <ex:port>p1</ex:port>
</prov:wasGeneratedBy>
wasGeneratedBy(ent1, act1, 
  2017-10-26T21:32:52Z, ex:port="p1")
:ent1
  a prov:Entity;
  prov:wasGeneratedBy :act1;
  prov:generatedAtTime "2017-10-26T21:32:52Z"^^xsd:dateTime ;
  ex:port "p1" .
    "wasGeneratedBy": {
        "ex:gen1": {
            "prov:entity": "ent1",
            "prov:activity": "act1",
            "prov:time": "2017-10-26T21:32:52Z",
            "ex:port": "p1"
        },
    },
{ "@context": { .. }, 
  "@id": "ent1",
  "@type": "prov:Entity",
  "ex:port": "p1",
  "prov:generatedAtTime":  "2017-10-26T21:32:52Z",
  "prov:wasGeneratedBy": {
    "@id": "act1",
    "@type": "prov:Activity"
  } 
}

PROV-N

PROV-XML

PROV-JSON

PROV-O Turtle

PROV-O JSON-LD

Linked Data

<Alice> <knows> <Bob> .

A triple express a relationship between two named resources.

<Alice>   <knows> <Bob> .
<Alice>   <knows> <Charlie> .
<Alice>   <plays> <Tennis> .
<Bob>     <knows> <Charlie> .
<Bob>     <plays> <Football> .
<Charlie> <plays> <Tennis> .

Multiple triples with shared resources form a graph

<http://example.com/Alice>   <knows> <http://example.com/Bob> .
<http://example.com/Alice>   <knows> <http://example.com/Charlie> .
<http://example.com/Alice>   <plays> <http://dbpedia.org/resource/Tennis> .
<http://example.com/Bob>     <knows> <http://example.com/Charlie> .
<http://example.com/Bob>     <plays> <http://dbpedia.org/resource/Football> .
<http://example.com/Charlie> <plays> <http://dbpedia.org/resource/Tennis> .

URLs on the Web can make global identifiers

<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Leadóg"@ga .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Tennis"@sv .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Tenis"@es .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "テニス"@ja .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Tenis"@cs .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "网球"@zh .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Теннис"@ru .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Tenis"@pl .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Tennis"@de .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Teniso"@eo .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Tennis"@en .
<http://dbpedia.org/resource/Tennis> <http://www.w3.org/2000/01/rdf-schema#label> "Tenis"@eu .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "Tennis is a racket sport that is played either individually against a single opponent (singles) or between two teams of two players each (doubles). Each player uses a tennis racket that is strung with cord to strike a hollow rubber ball covered with felt over or around a net and into the opponent's court. The object of the game is to manoeuvre the ball in such a way that the opponent is not able to play a valid return. The player who is unable to return the ball validly will not gain a point, while the opposite player will. Tennis is an Olympic sport and is played at all levels of society and at all ages. The sport can be played by anyone who can hold a racket, including wheelchair users. The modern game of tennis originated in Birmingham, England, in the late 19th century as lawn tennis. It had close connections both to various field (lawn) games such as croquet and bowls as well as to the older racket sport today called real tennis. The rules of modern tennis have changed little since the 1890s. Two exceptions are that until 1961 the server had to keep one foot on the ground at all times, and the adoption of the tiebreak in the 1970s. A recent addition to professional tennis has been the adoption of electronic review technology coupled with a point-challenge system, which allows a player to contest the line call of a point, a system known as Hawk-Eye. Tennis is played by millions of recreational players and is a popular worldwide spectator sport. The four Grand Slam tournaments (also referred to as the majors) are especially popular: the Australian Open, played on hardcourts; the French Open, played on red clay courts; Wimbledon, played on grass courts; and the US Open, also played on hardcourts."@en .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "Ténis (português europeu) ou tênis (português brasileiro) é um esporte de origem inglesa, disputado em quadras geralmente abertas e de superfícies sintéticas, cimento, saibro ou relva. Participam no jogo dois oponentes ou duas duplas de oponentes, podendo ser mistas (homens e mulheres) ou não. A quadra é dividida em duas meia-quadras por uma rede, e o objetivo do jogo é rebater uma pequena bola para além da rede (para a meia-quadra adversária) com ajuda de uma raquete. Para marcar um ponto é preciso que a bola toque no solo em qualquer parte dentro da quadra adversária incluindo as linhas que demarcam o campo do oponente, fazendo com que o adversário não consiga devolver a bola antes do segundo toque, ou que a devolva para fora dos limites da outra meia-quadra. O desporto assim possui aspectos de ataque (rebater bem a bola, dificultando a devolução do adversário) e defesa (bom posicionamento em quadra, antecipação do lance adversário etc). O tênis possui um intricado sistema de pontuação, que subdivide o jogo em games/jogospt e sets/partidas pt. Grosso modo, um game é um conjunto de pontos (15-30-40-game) e um set é um conjunto de games (1-2-3-4-5-set). Cada game tem um jogador responsável por recolocar a bola em jogo: fazer o serviço ou sacar. No tênis de competição, é comum que o jogador que serve fature o game, já que tem a vantagem do ataque e dita o ritmo do jogo. Desta forma, uma das estratégias de jogo é tentar inverter esta vantagem durante a troca de bola ou durante a defesa fazer com que o adversário, através de erros, perca os games em que está sacando. Ganha o jogobr/encontropt aquele que atingir um número de sets pré-definido — geralmente 2 sets, sendo de 3 sets para os grandes torneios masculinos."@pt .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "Il tennis, il cui nome può essere tradotto in italiano con un termine arcaico non più in uso di “schiaffo palla” e storicamente noto col nome di \"pallacorda\", è uno sport della racchetta che vede opporsi due o quattro giocatori (due contro due, incontro di doppio) in un campo da gioco diviso in due metà da una rete alta 0,914 m al centro e 1,07 m ai lati. È considerato uno sport completo e armonico (sebbene fisicamente in parte asimmetrico perché un braccio viene escluso dal movimento del corpo) in quanto sono richieste al contempo ottime capacità fisiche (coordinazione, velocità e resistenza sulla distanza) e mentali (tecnica nei colpi, tattica, visione di gioco, riflessi e intuito)."@it .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "Tenis (angl. tennis < angl. tenes, tenetz < fr. tenez! = berte, držte! (imperativ pl. slova držet)), označovaný také jako bílý sport, je míčová hra pro 2 nebo 4 hráče. Varianta se 2 hráči se nazývá , varianta se 4 hráči pak čtyřhra. Hraje se také smíšená čtyřhra, při které v každé dvojici hraje jedna žena a jeden muž. Tenisový míč Soupeři stojí proti sobě, na obdélníkovém hřišti (tenisovém dvorci) a pokoušejí se odrazit tenisový míček tenisovou raketou do pole tak, aby jej soupeř nemohl vrátit nebo aby se s míčkem trefil vedle tenisového dvorce (do autu)."@cs .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "كرة المضرب (أو التنس الأرضي في الترجمات الحرفية) نوع من رياضات الراح والتي يتنافس فيها لاعبان في مباريات فردية، أو فريقان مكونان من لاعبين في مباريات زوجية. كلٌ منهم يحمل مضربا ليستخدمه في ضرب الكرة فوق الشبكة نحو منطقة الخصم. وعدد الضربات ليس محددا، إنما النتيجة تحدد الرابح. كرة المضرب أو التنس هو رياضة أولمبية تُلعب على جميع مستويات المجتمع وفئاته ولجميع الأعمار، يمكن لأي شخص أن يمسك بمضرب ويضرب الكرة بما في ذلك ذوو الاحتياجات الخاصة ومستخدمو الكراسي المتحركة. نشأت لعبة التنس الحديثة في برمنغهام بإنجلترا في أواخر القرن التاسع عشر كرياضة تلعب على العشب، لذا فهي وثيقة الصلة بكل من الألعاب الميدانية كالكريكت والبولينغ، إضافة إلى رياضة المضرب القديمة. تغيرت قواعد كرة المضرب الحديثة قليلاً منذ تسعينيات القرن التاسع عشر، باستثناء قاعدتين: من 1908 إلى 1961 كان على اللاعب أن يضع قدمًا واحدة على الأرض في جميع أوقات المباراة، وفي السبعينيات اعتمد الشوط الفاصل (تاي بريك). ومن الإضافات الحديثة إلى اللعبة على المستوى الاحترافي اعتماد تقنية المراجعة الإلكترونية (عين الصقر) إلى جانب نظام تحدي النقاط . يلعب كرة المضرب الملايين من لاعبين، وهي رياضة مشهورة ومنتشرة في جميع أنحاء العالم، وتحظى البطولات الكبرى لكرة المضرب بمتابعة جماهيرية وإعلامية كبيرة، تضعها في مرتبة متقدمة بين الأحداث الرياضية الأكثر أهمية ومتابعة في العالم سنويا. وأهم هذه البطولات: رولان غاروس أو بطولة فرنسا المفتوحة على الملاعب الرملية، بطولة أستراليا المفتوحة على الملاعب الصلبة، بطولة ويمبلدون على الملاعب العشبية، وبطولة الولايات المتحدة المفتوحة على الملاعب الصلبة."@ar .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "Teniso estas rakedsporto, kutime ludata de du homoj, sed ankaŭ foje de kvar homoj. Specifa para ludo estas la ambaŭseksa para ludo, kiam ĉiu teamo konsistas el unu viro kaj unu virino. Estas principe tri kategorioj de tenisejo, nome la gazona, malmola kaj batita terenoj. Dumlude, la ludantoj ĉiam devas revenigi la pilkon per siaj rakedoj al kontraŭa kampo, provante samtempe malebligi tion al oponanto(j)."@eo .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "Tenisa Ingalaterran eta XIX. mendean sortutako kirola da, pilota batekin eta bi edo lau jokalarien artean jokatua. Tenis jokalariei tenislari ere deritze. Mende haren amaieran arautu zuten: Sare batek banatzen duen zelai batean jokatzen da, jokalariak erraketez baliatzen dira pilota sarearen gainetik aurkariaren zelaira botatzeko, partidek ez dute iraupen jakinik (beharrezko puntuazioa lortzean bukatzen da), prestakuntza fisikoa eta teknika behar dira tenisaren ezaugarriak eta arautegia betetzeko."@eu .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/ontology/abstract> "Tenis atau bola tampel adalah olahraga yang biasanya dimainkan antara dua pemain atau antara dua pasangan masing-masing dua pemain. Setiap pemain menggunakan raket untuk memukul bola karet. Tujuan permainan adalah memainkan bola dengan cara tertentu sehingga pemain lawan tidak dapat mengembalikan bola tersebut. Tenis adalah salah satu cabang olahraga Olimpiade dan dimainkan pada semua tingkat masyarakat di segala usia. Olahraga ini dapat dimainkan oleh siapa saja, termasuk orang-orang yang menggunakan kursi roda. Permainan tenis modern berasal dari Birmingham, Inggris pada akhir abad ke-19 sebagai \"tenis lapangan rumput\". Peraturan tenis berubah sedikit sejak 1890-an. Dua perubahan kecil adalah sejak 1908 hingga 1961 pemain yang melakukan service (pukulan pertama) harus menjaga salah satu kakinya tetap di tanah hingga service berpindah dan adopsi sistem tie-break pada 1970-an. Tambahan terakhir yang diterapkan pada tenis profesional adalah teknologi tinjauan ulang elektronik. Tenis dimainkan oleh jutaan orang sebagai olahraga rekreasi dan juga merupakan olahraga tontontan populer di seluruh dunia.Empat kejuaraan tenis terkemuka adalah Australia Terbuka yang dimainkan di lapangan keras, Prancis Terbuka yang dimainkan di lapangan tanah liat, Wimbledon yang dimainkan di lapangan rumput, dan AS Terbuka yang dimainkan juga di lapangan keras."@in .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/property/venue> <http://dbpedia.org/resource/Tennis_court> .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/property/equipment> <http://dbpedia.org/resource/Net_sport> .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/property/equipment> <http://dbpedia.org/resource/Tennis_ball> .
<http://dbpedia.org/resource/Tennis> <http://dbpedia.org/property/equipment> <http://dbpedia.org/resource/Racket_(sports_equipment)> .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "Bola tenis adalah bola yang dirancang untuk olahraga tenis. Bola tenis berwarna kuning fluoresens pada acara olahraga besar, tetapi dalam permainan rekreasi bisa berwarna apa saja. Bola tenis ditutupi dengan serat berserat yang memodifikasi sifat aerodinamisnya, dan masing-masing memiliki lekukan lonjong melengkung putih yang menutupinya. Bola tenis diisi dengan udara dan permukaannya ditumbuhi oleh kompon karet berserat seragam. Keterlambatan terasa pemisahan aliran di lapisan batas yang mengurangi hambatan aerodinamik dan memberi bola sifat terbang yang lebih baik."@in .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "Tenisový míč či tenisák je vyrobený z pevné gumové duše, která je obalena meltonem a nahuštěná vzduchem. Jen míče značky Tretorn jsou zcela vyplněny pórovitou hmotou. Jedná se o míč určený pro tenis. Povrch je jednobarevný, dříve byl bílý či oranžový. Existují však i míčky vícebarevné, například zeleno-růžové. Barva dnešního míčku závisí na osvětlení a člověku, který se na míček dívá. Někteří ho vidí neonově žlutý, jiní světle zelený a další zeleno-žlutý."@cs .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "كرة التنس هي كرة مصممة لممارسة رياضة التنس. كرات التنس لونها صفراء مُتفتحة في الأحداث الرياضية الكبرى، ولكن في اللعب الترفيهي يمكن أن يكون أي لون تقريبًا. وتغطي كرات التنس بواسطة لباد ليفي الذي يعدل من مرور الكرات في الهواء. تمتلئ كرات التنس بالهواء وتظهر على السطح بواسطة مركب مطاطي مغطى باللباد. يؤخر فصل التدفق في الطبقة الحدودية مما يقلل من السحب الديناميكي الهوائي ويمنح الكرة خصائص طيران أفضل. غالبًا ما تحتوي الكرات على رقم بالإضافة إلى اسم العلامة التجارية. وهذا يساعد على تمييز مجموعة من الكرات عن مجموعة أخرى من نفس العلامة التجارية في ملعب مجاور."@ar .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "テニスボール(英: tennis ball)とは、狭義には、テニス(狭義のテニス、すなわち、硬式テニスともいうローンテニス)用に作られたボール。厚いに毛羽のあるフェルトを被せた球の形で、フェルトによる特有の反発性をもつ。art-antsを呼称する言葉でもある。 また、広義の「テニス」にはソフトテニス(en. 軟式テニス、軟式庭球)も含まれることから、ソフトテニスのボールも含む。"@ja .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "Der Tennisball ist das Spielgerät, mit dem Tennis gespielt wird. Er wird mit Hilfe der Tennisschläger gespielt."@de .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "Een tennisbal is een bal die wordt gebruikt bij tennis. Er wordt tegenaan geslagen met een tennisracket en met de bal worden punten gescoord. Er zijn verschillende soorten tennisballen voor verschillende baansoorten en weertypen. De meest voorkomende kleur is fluorescerend geel met een lichtgekleurde kromme streep."@nl .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "Una pelota de tenis es un objeto esférico de pequeño diámetro (especialmente en comparación con la medida de otras pelotas) que se utiliza en la práctica del deporte homónimo. Para impulsarla se utiliza una raqueta. Un profesional puede hacer que llegue a coger una velocidad de 263 km/h. Se recubre con fieltro (en general de color amarillo fluorescente con rayas de separación blancas), puesto que es el «color más visible por el ojo humano» y permite una trayectoria más correcta, en comparación con una totalmente lisa.​ De media, una pelota de tenis pesa de 56 a 58 gramos.​​"@es .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "Tennisboll är en typ av boll som används inom sporten tennis. Officiella bollar skall vara vita eller gula (träningsbollar kan ha andra färger) och ha en diameter mellan 65,4–73,0 mm (2,58–2,87 tum). De skall väga mellan 56,0 och 59,4 gram. Bollarna ska studsa 135–147 cm efter att ha släppts från en höjd av 254 cm på ett betonggolv. Den moderna tennisbollen tillverkas av två likformade delar av format gummi som sammanfogats till ett ihåligt klot. Två hantelformade filtade delar av flanell eller nylon limmas utanpå bollen och ger den dess karakteristiska utseende."@sv .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "A tennis ball is a ball designed for the sport of tennis. Tennis balls are fluorescent yellow in organised competitions, but in recreational play can be virtually any color. Tennis balls are covered in a fibrous felt which modifies their aerodynamic properties, and each has a white curvilinear oval covering it."@en .
<http://dbpedia.org/resource/Tennis_ball> <http://www.w3.org/2000/01/rdf-schema#comment> "Теннисный мяч — предназначен для тенниса. Он примерно 6,7 см (2,63 дюйма) в диаметре. Теннисные мячи на крупных спортивных соревнованиях желтые, но могут быть практически любого цвета. Теннисные мячи покрыты войлоком, который модифицирует их аэродинамические свойства, и имеют белую волнистую линию, которая охватывает снаряд."@ru .

Linked Data: following the URLs to find more triples

Linked Data standards

..used by >50% of all websites?

2010–: Research Objects

Model for Research Objects

Reused multiple ontologies

Implemented by workflow systems and supporting RO services

Data (and metadata) should be:

  • Findable
  • Accessible
  • Interoperable
  • Reusable

Aim of FAIR principles

Machine readable data & metadata

Community agreement on standards

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

 

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

 

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

 

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

 

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

 

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

 

To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

 

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

 

To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

 

The FAIR Principles have propelled the global debate about better data stewardship in data-driven and open science, and they have triggered funding bodies (e.g., UKRI, Wellcome Trust, USA NIH, EC Horizon 2020) to mandate FAIR data in their data policies.

2019: My PhD journey starts

Supervisors: Paul Groth, Carole Goble

 Peter Sefton at Open Repositories 2019

Cultural heritage records

Research Object Crate

Aims of

FAIR Research Objects

Describe and package data collections, datasets, software etc. with their metadata

Platform-independent object exchange between repositories and services

Support reproducibility and analysis: link data with codes and workflows

Transfer of sensitive/large distributed datasets with persistent identifiers

 

Propagate provenance, citations and existing metadata

Reuse existing standards, but hide their complexity

Vocabulary: A set of types with properties

 

schema.org is used by almost half of all popular websites -- mainly to improve search engine ranking?

 

 

Explaining by example for developers: Linked Data by stealth

Capturing workflows as FAIR Research Objects

Executions of workflows can be

tracked using a profile of RO-Crate

 

Interoperable reproducibility

of computational methods

 

Developed as a community task force, 7+ implementations

Multiple implementation levels

Pushing the envelope of FAIR

Five Safes principles

Green & Ritchie (2023)

https://doi.org/10.29012/jpc.831

Trusted Research Environments (TRE) ensure safe access to sensitive data

 

TREs make a locked down computational space (e.g. a VM) and enforce disclosure control before outputs leaves the premises.

Federated Analytics with Five Safes RO-Crate

Federated Analytics with Five Safes RO-Crate

RO-Crate used a common vocabulary, so it was easy to extend for human actions

FAIR Research Objects can be created in a pragmatic way

 

By using Web technologies and explaining by example,

Linked Data technology can be used by most developers

 

Profiles as a lightweight approach to ontologies, a "just enough" formalisation of usage patterns and domain vocabularies

 

Linked Data as basis:
formalisations, extension mechanisms, graph queries, validation, integration/mapping, ++

(only use these if you need to!)

The RO-Crate team is:

RO-Crate team

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@type": "CreativeWork",
  "@id": "#talk",
  "name": "Challenging Linked Data Practices for FAIR Research Objects and Computational Workflows",
  "description": "A presentation discussing the role of Linked Data and RO-Crate in the era of LLMs, emphasizing their necessity for FAIR computational workflows.",
  "author": {
    "@type": "Person",
    "name": "[Your Name]",
    "affiliation": "[Your Institution]",
    "orcid": "[Your ORCID]"
  },
  "datePublished": "2025-02-19",
  "keywords": ["RO-Crate", "Linked Data", "FAIR", "Computational Workflows"],
  "subjectOf": {
    "@type": "Event",
    "name": "[Conference or Workshop Name]",
    "startDate": "[Event Date]",
    "location": {
      "@type": "Place",
      "name": "[Event Location]"
    }
  },
  "relatedMaterial": [
    {
      "@type": "CreativeWork",
      "name": "RO-Crate Specification",
      "url": "https://www.researchobject.org/ro-crate/"
    }
  ]
}
{
  "@context": [
    "https://schema.org/",
    "https://w3id.org/ro/crate/1.1/context"
  ],
  "@graph": [
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
      "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
      "about": {"@id": "./"},
      "description": "RO-Crate Metadata for the presentation: Challenging Linked Data Practices for FAIR Research Objects and Computational Workflows"
    },
    {
      "@id": "./",
      "@type": "Dataset",
      "name": "Presentation RO-Crate",
      "hasPart": [
        {"@id": "presentation.pdf"},
        {"@id": "#presentation"}
      ],
      "license": {"@id": "https://creativecommons.org/licenses/by/4.0/"},
      "publisher": "FAIR Symposium 2023",
      "datePublished": "2023-10-15"
    },
    {
      "@id": "#presentation",
      "@type": ["PresentationDigitalDocument", "File"],
      "name": "Challenging Linked Data Practices for FAIR Research Objects and Computational Workflows",
      "description": "A critical analysis of Linked Data practices in FAIR workflows and Research Objects.",
      "author": {"@id": "#alice"},
      "dateCreated": "2023-10-10",
      "keywords": ["FAIR", "Linked Data", "RO-Crate", "Workflows"],
      "about": [
        {"@id": "https://doi.org/10.0000/example"},
        {"@id": "#event"}
      ]
    },
    {
      "@id": "#alice",
      "@type": "Person",
      "name": "Alice Researcher",
      "affiliation": "FAIR Institute",
      "identifier": "https://orcid.org/0000-0000-0000-0000"
    },
    {
      "@id": "#event",
      "@type": "Event",
      "name": "FAIR Symposium 2023",
      "startDate": "2023-10-15",
      "location": "Virtual"
    },
    {
      "@id": "presentation.pdf",
      "@type": "File",
      "name": "Slide deck (PDF)",
      "encodingFormat": "application/pdf",
      "contentSize": "2.5 MB"
    }
  ]
}

Anticipated Question:

"With the rise of LLMs, why do we still need structured data formats like RO-Crate and Linked Data?"

Answer Framework

  1. Reliability & Consistency:
    LLMs generate probabilistic outputs, which can be inconsistent or hallucinate metadata. Structured data (e.g., JSON-LD in RO-Crate) ensures machine-actionable, predictable, and domain-specific metadata, critical for reproducibility and FAIR compliance.

  2. Interoperability:
    LLMs lack built-in semantics for cross-system data exchange. Formats like RO-Crate use standards (e.g., schema.org, PROV-O) to enable automated integration across repositories, workflow engines, and research infrastructures.

  3. Context Preservation:
    LLMs may strip contextual relationships (e.g., authorship, funding, dependencies). RO-Crate explicitly captures provenance, licensing, and dependencies as structured graphs, preserving research context.

  4. Computational Actionability:
    Structured metadata allows tools (e.g., workflow runners, validators) to programmatically resolve dependencies, verify licenses, or trace data lineage without human intervention—LLMs alone cannot do this.

  5. Synergy with LLMs:
    LLMs can enhance RO-Crate creation (e.g., auto-annotating files) or querying (natural language to SPARQL). However, they rely on structured data as a ground-truth source for training and reasoning.

Anticipated Question:

"With the rise of LLMs, why do we still need structured data formats like RO-Crate and Linked Data?"

Answer Framework

  1. Reliability & Consistency:
    LLMs generate probabilistic outputs, which can be inconsistent or hallucinate metadata. Structured data (e.g., JSON-LD in RO-Crate) ensures machine-actionable, predictable, and domain-specific metadata, critical for reproducibility and FAIR compliance.

  2. Interoperability:
    LLMs lack built-in semantics for cross-system data exchange. Formats like RO-Crate use standards (e.g., schema.org, PROV-O) to enable automated integration across repositories, workflow engines, and research infrastructures.

  3. Context Preservation:
    LLMs may strip contextual relationships (e.g., authorship, funding, dependencies). RO-Crate explicitly captures provenance, licensing, and dependencies as structured graphs, preserving research context.

  4. Computational Actionability:
    Structured metadata allows tools (e.g., workflow runners, validators) to programmatically resolve dependencies, verify licenses, or trace data lineage without human intervention—LLMs alone cannot do this.

  5. Synergy with LLMs:
    LLMs can enhance RO-Crate creation (e.g., auto-annotating files) or querying (natural language to SPARQL). However, they rely on structured data as a ground-truth source for training and reasoning.

Anticipated Question:

"With the rise of LLMs, why do we still need structured data formats like RO-Crate and Linked Data?"

Answer Framework

  1. Reliability & Consistency:
    LLMs generate probabilistic outputs, which can be inconsistent or hallucinate metadata. Structured data (e.g., JSON-LD in RO-Crate) ensures machine-actionable, predictable, and domain-specific metadata, critical for reproducibility and FAIR compliance.

  2. Interoperability:
    LLMs lack built-in semantics for cross-system data exchange. Formats like RO-Crate use standards (e.g., schema.org, PROV-O) to enable automated integration across repositories, workflow engines, and research infrastructures.

  3. Context Preservation:
    LLMs may strip contextual relationships (e.g., authorship, funding, dependencies). RO-Crate explicitly captures provenance, licensing, and dependencies as structured graphs, preserving research context.

  4. Computational Actionability:
    Structured metadata allows tools (e.g., workflow runners, validators) to programmatically resolve dependencies, verify licenses, or trace data lineage without human intervention—LLMs alone cannot do this.

  5. Synergy with LLMs:
    LLMs can enhance RO-Crate creation (e.g., auto-annotating files) or querying (natural language to SPARQL). However, they rely on structured data as a ground-truth source for training and reasoning.

What about the below JSON-LD of an RO-Crate, can you summarise it for me?
(..)

1. Overview of the Dataset

  • Title: "BY-COVID - WP5 - Baseline Use Case: SARS-CoV-2 vaccine effectiveness assessment - Local outputs and comparative analysis"

  • Description:
    The dataset contains local outputs and comparative analysis from a federated study across multiple European regions (Aragon, Spain; Brussels and Wallonia, Belgium; Finland). It evaluates the effectiveness of SARS-CoV-2 vaccines in preventing infections using observational, retrospective longitudinal data.

  • Research Question:
    "How effective have the SARS-CoV-2 vaccination programmes been in preventing SARS-CoV-2 infections?"

  • Study Design:
    Emulates a hypothetical Target Trial (TT) using real-world data, with a federated analysis approach to ensure data privacy and security.

2. Key Components

Funding

3. Workflow and Computational Details

  • Main Workflow:

    • Title: "BY-COVID WP5 T5.2 Baseline Use Case"

    • DOI: 10.48546/workflowhub.workflow.502.4

    • Programming Language: Quarto Markdown

    • Inputs: Vaccine effectiveness data (CSV files)

    • Outputs: Interactive HTML reports and Excel files

    • Containerization: Docker and Singularity containers were used for execution in secure environments.

Workflow Runs

  • Regional Workflows:

    • Aragon, Spain: Executed manually in a Docker container due to secure environment constraints.

    • Brussels and Wallonia, Belgium: Executed in a manually configured R environment.

    • Finland: Executed in a Singularity container.

  • Comparative Analysis:
    Aggregated results from regional workflows into an interactive HTML report using a Quarto script.

4. Files and Outputs

Local Outputs

  • Interactive HTML Reports:
    Each region produced 6 interactive reports covering:

    1. Data Quality Assessment (DQA)

    2. Validation

    3. Imputation

    4. Matching

    5. Descriptive Analysis

    6. Survival Analysis

  • Excel Files:
    Summarized results for each region (e.g., results-survival-analysis-ESP.xlsx).

Comparative Analysis

  • Interactive HTML Report:
    Aggregated results from all regions into a single interactive report.

  • Quarto Script:
    Used to generate the comparative analysis report.

Reproducible Environment

  • R Environment:

    • renv/: R environment with activation scripts.

    • renv.lock: Lockfile for reproducible package versions.

    • .Rprofile: R profile for environment configuration.

5. Provenance and Execution

  • Provenance:
    Detailed execution logs for each workflow run, including:

    • Container Images: Docker and Singularity images used for execution.

    • Execution Times: Start and end times for each run.

    • Agents: Researchers who executed the workflows (e.g., Marjan Meurisse, Francisco Estupiñán-Romero).

  • Action Status: All workflows were completed successfully.

6. Linked Data and Interoperability

  • Linked Data:
    The RO-Crate uses JSON-LD to link entities (e.g., workflows, datasets, authors, organizations) using standardized vocabularies like schema.org, EDAM, and PROV-O.

  • Interoperability:
    The crate is designed to be interoperable with tools like WorkflowHub and Zenodo, ensuring FAIR (Findable, Accessible, Interoperable, Reusable) compliance.

7. Significance

  • FAIR Compliance:
    The RO-Crate ensures that the research outputs are FAIR by providing structured metadata, provenance, and reusable workflows.

  • Reproducibility:
    The use of containers (Docker, Singularity) and reproducible environments (R, renv) ensures that the analysis can be rerun in the future.

  • Privacy and Security:
    The federated approach and secure processing environments (SPEs) ensure that sensitive health data is not exposed.

Conclusion

This RO-Crate exemplifies how structured data formats like JSON-LD and standards like RO-Crate enhance reproducibility, interoperability, and FAIRness in research. While LLMs can assist in generating or querying metadata, structured data remains essential for ensuring machine-actionability, provenance tracking, and long-term preservation of research outputs

 

3. Workflow and Computational Details

  • Main Workflow:

    • Title: "BY-COVID WP5 T5.2 Baseline Use Case"

    • DOI: 10.48546/workflowhub.workflow.502.4

    • Programming Language: Quarto Markdown

    • Inputs: Vaccine effectiveness data (CSV files)

    • Outputs: Interactive HTML reports and Excel files

    • Containerization: Docker and Singularity containers were used for execution in secure environments.

Workflow Runs

  • Regional Workflows:

    • Aragon, Spain: Executed manually in a Docker container due to secure environment constraints.

    • Brussels and Wallonia, Belgium: Executed in a manually configured R environment.

    • Finland: Executed in a Singularity container.

  • Comparative Analysis:
    Aggregated results from regional workflows into an interactive HTML report using a Quarto script.

4. Files and Outputs

Local Outputs

  • Interactive HTML Reports:
    Each region produced 6 interactive reports covering:

    1. Data Quality Assessment (DQA)

    2. Validation

    3. Imputation

    4. Matching

    5. Descriptive Analysis

    6. Survival Analysis

  • Excel Files:
    Summarized results for each region (e.g., results-survival-analysis-ESP.xlsx).

Comparative Analysis

  • Interactive HTML Report:
    Aggregated results from all regions into a single interactive report.

  • Quarto Script:
    Used to generate the comparative analysis report.

Reproducible Environment

  • R Environment:

    • renv/: R environment with activation scripts.

    • renv.lock: Lockfile for reproducible package versions.

    • .Rprofile: R profile for environment configuration.

5. Provenance and Execution

  • Provenance:
    Detailed execution logs for each workflow run, including:

    • Container Images: Docker and Singularity images used for execution.

    • Execution Times: Start and end times for each run.

    • Agents: Researchers who executed the workflows (e.g., Marjan Meurisse, Francisco Estupiñán-Romero).

  • Action Status: All workflows were completed successfully.

6. Linked Data and Interoperability

  • Linked Data:
    The RO-Crate uses JSON-LD to link entities (e.g., workflows, datasets, authors, organizations) using standardized vocabularies like schema.org, EDAM, and PROV-O.

  • Interoperability:
    The crate is designed to be interoperable with tools like WorkflowHub and Zenodo, ensuring FAIR (Findable, Accessible, Interoperable, Reusable) compliance.

7. Significance

  • FAIR Compliance:
    The RO-Crate ensures that the research outputs are FAIR by providing structured metadata, provenance, and reusable workflows.

  • Reproducibility:
    The use of containers (Docker, Singularity) and reproducible environments (R, renv) ensures that the analysis can be rerun in the future.

  • Privacy and Security:
    The federated approach and secure processing environments (SPEs) ensure that sensitive health data is not exposed.

Conclusion

This RO-Crate exemplifies how structured data formats like JSON-LD and standards like RO-Crate enhance reproducibility, interoperability, and FAIRness in research. While LLMs can assist in generating or querying metadata, structured data remains essential for ensuring machine-actionability, provenance tracking, and long-term preservation of research outputs

 

Challenging Linked Data practices for FAIR Research Objects and Computational Workflows

By Stian Soiland-Reyes

Challenging Linked Data practices for FAIR Research Objects and Computational Workflows

Presented 2025-02-19 at Inaugural seminar at Department of Computer Science, The University of Manchester

  • 68