PFHub UpDates and Ideas

Daniel Wheeler

Phase Field Workshop, 2023-08-16

Long Term Vision for pfhub

Central registry of phase field curated results with CLI/Web/API tool to view and query results in many way.
Website is the registry and examples of using the tool with views + phase field guide materials

Please add ideas and open discussions on usnistgov/pfhub

UPdates

Update environments
- Nix environments updated
- Native Python (Pip, Conda, Mamba) environment implemented / tested
Implementing CLI tool for PFHub
Zenodo submission process
Papermill / Jupyter website build
- BM1, BM2, BM3, BM4, BM7, BM8
Generated new schema using linkml (Trevor)

Fair Improvements

New schema in human readable form using LinkML
- Seamless conversion between schema.org, json-schema, jsonld, yaml
- MaRDA working group for more general phase field schema (tomorrow)
Require implementation to be in publicly accessible archive
Encourage use of FAIR4RS principles (metadata.json in repo)
Require curation of result data on Zenodo (or similar)
Improve data accessibility using Jupyter Notebooks and Python utility (in place of JS stack and custom apps)

PFHUB CLI TOOL

Makes the submission process more coherent
- CLI tool can be used by a user on the local filesystem for submissions
  - View / compare results on local FS as they appear on website
- Use same CLI tool for automated submissions and continuous integration
- Implement Zenodo / PFHub submission to be a seamless process
Not quite finished for this meeting
First version on PyPI soon
Eventually the CLI will be subsumed by an upload notebook hosted locally or via cloud service

What Next?

What next?
- Split repository into python-pfhub and web
- Finish new upload process with upload notebook built using CLI
- Use Jupyter Book to build website (or equivalent)
- Update BM5, BM6 and include BM9
- Small things
  - DOIs for benchmark notebooks with appropriate authors #1515
Aspirational goals
- Cloud-hosted submission notebook
- Increase data capabilities, metrics and display
  - Field data
- Expand beyond Zenodo

LOCAL FS

pfhub CLI

USER

submission process

notebooks

PFHUB.YAml

csv, VTK, ...

Github REview

pfhub CLI

reviewer

surge

ACTions

website

HOSTED Submission Notebook

PFHUB CLI

$ pfhub --help
Usage: pfhub [OPTIONS] COMMAND [ARGS]...

  Submit results to PFHub and manipulate PFHub data

Options:
  --help  Show this message and exit.

Commands:
  convert             Convert between formats (old PFHub schema to new...
  convert-to-old      Convert between formats (new PFHub schema to old...
  download            Download a PFHub record
  download-zenodo     Download a Zenodo record
  generate-notebook   Generate the comparison notebook for the...
  generate-yaml       Infer a PFHub YAML file from GitHub ID, ORCID,...
  submit              Submit to Zenodo and open PFHub PR
  submit-from-zenodo  Submit an existing Zenodo record to PFHub
  test                Run the PFHub tests
  upload              Upload PFHub data to Zenodo
  validate            Validate a YAML file with the new PFHub schema
  validate-old        Validate a YAML file with the old PFHub schema

  See the documentation at
  https://github.com/usnistgov/pfhub/blob/master/CLI.md (under construction)

What data to we currently collect?
- Provenance
- Benchmark ID
- Implementation repository
- Post-processed outputs
- Limited metadata
  - run time
  - memory usage
  - simulation time
- Limited hardware data
- Limited software data
- Dataframe style data / time series
  - time vs free energy

Data collection

OLD schema

---
_id: 93113e00-0c5e-11e8-b653-4f1ed6519c85
benchmark:
  id: 3a
  version: '1'
data:
- name: run_time
  values:
  - sim_time: '1500'
    wall_time: '266576'
- name: memory_usage
  values:
  - unit: KB
    value: '2000000'
- name: efficiency
  transform:
  - as: x
    expr: "1. / datum.time_ratio"
    type: formula
  - as: y
    expr: datum.memory
    type: formula
  values:
  - memory: 2000000.0
    time_ratio: 0.005626
- description: Free energy versus time
  format:
    parse:
      free_energy: number
      time: number
    type: csv
  name: free_energy
  transform:
  - as: x
    expr: datum.time
    type: formula
  - as: y
    expr: datum.free_energy
    type: formula
  type: line
  url: https://gist.githubusercontent.com/wd15/7e06a3141a6fbf317b1daf39ef1b0fbb/raw/2b802a25593501b30cb0d8648a3b588dc54b36f7/time.csv
- description: Solid fraction versus time
  format:
    parse:
      solid_fraction: number
      time: number
    type: csv
  name: solid_fraction
  transform:
  - as: x
    expr: datum.time
    type: formula
  - as: y
    expr: datum.solid_fraction
    type: formula
  type: line
  url: https://gist.githubusercontent.com/wd15/7e06a3141a6fbf317b1daf39ef1b0fbb/raw/2b802a25593501b30cb0d8648a3b588dc54b36f7/time.csv
- description: Tip position versus time
  format:
    parse:
      time: number
      tip_position: number
    type: csv
  name: tip_position
  transform:
  - as: x
    expr: datum.time
    type: formula
  - as: y
    expr: datum.tip_position
    type: formula
  type: line
  url: https://gist.githubusercontent.com/wd15/7e06a3141a6fbf317b1daf39ef1b0fbb/raw/2b802a25593501b30cb0d8648a3b588dc54b36f7/time.csv
- description: Zero contour at t=1500s
  format:
    parse:
      x: number
      y: number
    type: csv
  name: phase_field_1500
  type: line
  url: https://gist.githubusercontent.com/wd15/7e06a3141a6fbf317b1daf39ef1b0fbb/raw/d0dcd61541604127a16c017891dcda1577c92997/contour.csv
date: 1518046097
layout: post
message: ' '
metadata:
  author:
    email: daniel.wheeler2@gmail.com
    first: Daniel
    github_id: wd15
    last: Wheeler
  hardware:
    acc_architecture: none
    clock_rate: '3.2'
    cores: '1'
    cpu_architecture: x86_64
    nodes: '1'
    parallel_model: serial
  implementation:
    container_url: ''
    name: fipy
    repo:
      url: https://gist.github.com/wd15/7e06a3141a6fbf317b1daf39ef1b0fbb
      version: fc9134b08a9c
  summary: FiPy implementation of benchmark 3a on a 960x960 grid. The shape of the
    dendrite doesn't look exactly like the version in the notebook.
  timestamp: 2 February, 2018

New schema

id: fipy_1a_tkphd_pysparse
benchmark_problem: 1a.0
contributors:
- id: https://orcid.org/0000-0002-2920-8302
  name: Trevor Keller
  affiliation:
  - NIST
  email: trevor.keller@nist.gov
- id: https://orcid.org/0000-0002-2653-7418
  name: Daniel Wheeler
  affiliation:
  - NIST
  email: daniel.wheeler@nist.gov
date_created: '2017-01-10'
implementation:
  url: https://github.com/usnistgov/FiPy-spinodal-decomposition-benchmark/tree/master/periodic
results:
  fictive_time: 53333.3
  hardware:
    architecture: cpu
    cores: 1
    nodes: 1
  memory_in_kb: 28600
  time_in_s: 157187
  dataset_temporal:
  - name: free_energy.csv
    columns:
    - time
    - free_energy
schema:
  url: https://github.com/usnistgov/pfhub-schema/tree/e0010d9/project
summary: Serial Travis CI benchmark with FiPy, periodic domain
framework:
- url: https://www.ctcms.nist.gov/fipy/
  name: FiPy
  download: https://github.com/usnistgov/fipy
  version: 3.1.2
- url: https://github.com/usnistgov/steppyngstounes
  name: steppyngstounes
  download: https://github.com/usnistgov/steppyngstounes
  version: '0.0'

Data queries

How can we currently query the data

Plot the dendrite tip position for all results for a particular code
Show results only from a particular author
Show results that use >N nodes
Show results that use a GPU

Better ways to query the data

Show dendrite curves for all finite difference methods
Show the transient free energy curve for all results with nominal O(h⁴) accuracy
Show the resource usage per nominal DOF
Characterize Ostwald ripening simulations by a length scale associated with the microstructure
Color data points in an efficiency plot based on numerical method or meshing strategy

Improve schema

What else should we collect?

Descriptions of discretization methods (FD, FV, FE, Spectral, ...)
Nominal order of accuracy, nominal DOF, meshing strategy
Description of linear solvers, preconditioners, non-linear strategy
Time stepping strategy (implicit v explicit)
Field variables at various times for statistical post-processing
Links to input files (rather than just the implementation repository)
Container (Docker build, Singularity build, Nix build)
What about the actual problem being solved?

schema Discussion

Could we spend some time right now collecting ideas?

Think about these three questions.

How can we improve the PFHub phase field schema?
What data and metadata should PFHub require?
How would you imagine querying the data? What questions would you ask?
What publication could you generate given better data / metadata?

Collect some ideas here: https://github.com/usnistgov/pfhub/discussions/1514

Guyer rant: let's ask the question about how to use the data rather than waste time redesigning schemas

pfhub-workshop-aug-2023

By Daniel Wheeler

PFHub UpDates and Ideas

Daniel Wheeler

Long Term Vision for pfhub

UPdates

Fair Improvements

PFHUB CLI TOOL

What Next?

LOCAL FS

pfhub CLI

USER

submission process

notebooks

PFHUB.YAml

csv, VTK, ...

Github REview

pfhub CLI

reviewer

surge

ACTions

website

HOSTED Submission Notebook

PFHUB CLI

Data collection

OLD schema

New schema

Data queries

Improve schema

schema Discussion

pfhub-workshop-aug-2023

More from Daniel Wheeler