PFHub ReImplementation For Fair Data Collection

Daniel Wheeler

2022-05-11

OVERView

  • Using GitHub workflow (extreme FAIRness)
    • Examples from other communities
      • Nixpkgs
      • Conda-Forge
      • NIST Code Portal
  • Schema Tools
    • The CodeMeta Project
    • ASDF - Advanced Scientific Data Format
    • Boutiques
  • Python-PFHub (FAIRer than JS)
  • PFHub example submission using issue templates

Nixpkgs

  • Nixpkgs is a large community
    • 100k packages
    • 5k issues
    • 3.1k PRs
    • 4k contributers
  • Completely GitHub based
  • Single repository
  • 100s of CI workflows
  • Takes submissions from 100s of users everyday that require human interaction

Nixpkgs

Example submission on Nixpkgs

 

  • Checks to ensure compliance
  • Automated assignment

Nixpkgs

Example submission on Nixpkgs

 

  • Checks to ensure compliance
  • Automated assignment
  • Human interaction

Nixpkgs

Example submission on Nixpkgs

 

  • Checks to ensure compliance
  • Automated assignment
  • Human interaction
  • Human CIs
  • Automated CIs

ASIDE: USING NIX for DATA workflow

  • Nix used to orchestrate data workflows
  • Not just build workflows
  • Functional storage
  • Completely reproducible

USING nix in PFHub

  • PFHub uses Nix for all builds
  • Python-pfhub will also have Pip and Conda builds
  • Uses Cachix to cache builds
  • All CI builds with Nix

Conda-Forge

  • Different model to Nixpkgs
    • 16k repositories!!!
    • Users merge (not admins)
  • but all GitHub based

NIST Code portal

  • Akin to PFHub
    • Jekyll frontend
    • CMS-free
  • builds NIST's code.json daily
  • All GitHub Actions

SCHEMA: codemeta Project

  • Simple standard for code metadata (from science community)
  • Includes: 6 basic categories of data (software, discoverability, development, run-time, versions, other)
  • Plan to use this with PFHub
  • Metadata builder tools include web, cli, python

SCHEMA: ASDF

  • Settled on YAML in 2015 for astronomical data (many other choices)
  • ASCII and binary data in same file
    • include simple editable data files
    • supports compressed Numpy arrays
  • Number of readers available
  • No standard data model
  • See "ASDF: A new data format for astronomy", Greenfield et al.

SCHEMA: Boutiques

  • Not a workflow language
  • Formal command line description
  • Specify inputs and outputs

Boutiques output for "echo" command

PYTHON-PFHUB

  • Python package that deals with all data transformations and aggregations
  • All data exported as Pandas dataframes
  • Easy for others to augment, develop, change
  • Everything in Python
    • Plotly has improved Python support
  • Everything working outside of website setting

Python-PFHUB

PYTHON-PFHUB

pFHUb Submissions

  1. Started with simple YAML file
    • Fill out YAML file by hand
    • Submit pull-request
    • CIs + human checks in pull-request
  2. Next iteration included an upload form
    • Fill out sophisticated form
    • Submit and Staticman app submits pull-request
  3. Currently working on using GitHub issue template
    • Issue template is a simple form (not sophisticated)
    • On submission launches GitHub Action (parses form and submits pull-request)
  4. CLI tool?

DEMO

Discussion

  • Upload mechanism (CLI?, GitHub issue templates?)
  • Schema
  • File type (ASDF?)
Made with Slides.com