PFHub ReImplementation For Fair Data Collection
Daniel Wheeler
2022-05-11
OVERView
- Using GitHub workflow (extreme FAIRness)
- Examples from other communities
- Nixpkgs
- Conda-Forge
- NIST Code Portal
- Examples from other communities
- Schema Tools
- The CodeMeta Project
- ASDF - Advanced Scientific Data Format
- Boutiques
- Python-PFHub (FAIRer than JS)
- PFHub example submission using issue templates
Nixpkgs
- Nixpkgs is a large community
- 100k packages
- 5k issues
- 3.1k PRs
- 4k contributers
- Completely GitHub based
- Single repository
- 100s of CI workflows
- Takes submissions from 100s of users everyday that require human interaction
Nixpkgs
Example submission on Nixpkgs
- Checks to ensure compliance
- Automated assignment
Nixpkgs
Example submission on Nixpkgs
- Checks to ensure compliance
- Automated assignment
- Human interaction
Nixpkgs
Example submission on Nixpkgs
- Checks to ensure compliance
- Automated assignment
- Human interaction
- Human CIs
- Automated CIs
ASIDE: USING NIX for DATA workflow
- Nix used to orchestrate data workflows
- Not just build workflows
- Functional storage
- Completely reproducible
USING nix in PFHub
- PFHub uses Nix for all builds
- Python-pfhub will also have Pip and Conda builds
- Uses Cachix to cache builds
- All CI builds with Nix
Conda-Forge
- Different model to Nixpkgs
- 16k repositories!!!
- Users merge (not admins)
- but all GitHub based
NIST Code portal
- Akin to PFHub
- Jekyll frontend
- CMS-free
- builds NIST's code.json daily
- All GitHub Actions
SCHEMA: codemeta Project
- Simple standard for code metadata (from science community)
- Includes: 6 basic categories of data (software, discoverability, development, run-time, versions, other)
- Plan to use this with PFHub
- Metadata builder tools include web, cli, python
SCHEMA: ASDF
- Settled on YAML in 2015 for astronomical data (many other choices)
- ASCII and binary data in same file
- include simple editable data files
- supports compressed Numpy arrays
- Number of readers available
- No standard data model
- See "ASDF: A new data format for astronomy", Greenfield et al.
SCHEMA: Boutiques
- Not a workflow language
- Formal command line description
- Specify inputs and outputs
Boutiques output for "echo" command
PYTHON-PFHUB
- Python package that deals with all data transformations and aggregations
- All data exported as Pandas dataframes
- Easy for others to augment, develop, change
- Everything in Python
- Plotly has improved Python support
- Everything working outside of website setting
Python-PFHUB
PYTHON-PFHUB
pFHUb Submissions
- Started with simple YAML file
- Fill out YAML file by hand
- Submit pull-request
- CIs + human checks in pull-request
- Next iteration included an upload form
- Fill out sophisticated form
- Submit and Staticman app submits pull-request
- Currently working on using GitHub issue template
- Issue template is a simple form (not sophisticated)
- On submission launches GitHub Action (parses form and submits pull-request)
- CLI tool?
DEMO
Discussion
- Upload mechanism (CLI?, GitHub issue templates?)
- Schema
- File type (ASDF?)
pfhub-workshop-may-2022
By Daniel Wheeler
pfhub-workshop-may-2022
- 342