Reproducibility

using open data and open source 

 

Ann Gledson, Research Software Engineer

The University of Manchester
 

https://slides.com/anngledson/data-software-sustainability

Overview

  • Open Science
     
  • Open data, open source and open methodology
     
  • Environment data-set example

Open Science

  • Open Data
  • Open Source
  • Open Methodology
  • Open Peer Review
  • Open Access
  • Open Educational Resources

For overview: see NERC 'Constructing a Digital Environment' channel:                   Open Science webinar by Helen Glaves (British Geological Survey)

The nature of research

  • Research projects have short lifespans
  • Long-term ideas and goals
  • Non-continuous process
    • Funding
    • Loss of key developers / analysts
    • Changes in scope or delays
    • Changes in storage devices
  •  Difficult to plan resources
    • Data storage
    • Code re-use

Open Data

  • Reduce repetition of data cleaning and wrangling tasks
  • Focus more on key research questions
  • Reproducibility -> Trust!
  • Visibility:
    • E.g. Nature Scientific Data journal
    • Citations
  • Collaboration, communities and networking 
  • Funding opportunities

Data wrangling

FAIR data principles

Nature Scientific Data: principles

Credit: Scientists who share their data in a FAIR manner deserve appropriate credit. 

Re-use: Standardized and detailed descriptions make data easier to find and reuse. 

Quality: Critical evaluation is needed to verify experimental rigour.

Discovery: Scientists should be able to easily find datasets that are relevant. 

Open: Scientists work best when they can easily connect and collaborate.

Service: Committed to providing excellent service to both authors and readers.

(Full version: https://www.nature.com/sdata/about/principles)

Open Source

  • Reduce repetition of coding effort
  • Community of developers
    • Testing
    • Issue tracking
    • Pull requests
  • Reproducibility 
    • Github releases: Snapshots of code
  • Visibility:
    • Software Citation in Github
    • Findability
  • Master branch is public release code – each version on master is tagged with a semantic version number (https://semver.org/). Code in master works.
     
  • Release branch (if present) is temporary. Preparation for version release.  (Or use develop)
     
  • Commits on develop are stable and should work but some unanticipated unexpected behaviour may be present.
     
  • Feature branches are temporary*. Commits on these may be broken/unstable. 

Gitflow branch types

Gitflow release process

  • Tests run on develop branch – before making a release, make sure that all tests are passing (can be run as a CI job).
     
  • (optional) Create an issue that describes what changes have been made since last release version. Make sure everyone agrees that behaviour is as expected. Link to passing tests.
     
  • Merge develop into master and tag with the version number.

Open Methodology

  • Data analysis apps
    • R-Shiney Apps
    • Jupyter Notebook
    • Streamlit
    • Plotly Dash
    • Bokeh
       
  • Computational Workflows
    • FAIR workflows
    • Carole Goble UoM work (see links)

Example

Environment data

  • Automatic Urban and Rural Network (AURN)
    • NOx, SO2, O3, NO2, PM10, PM2.5
  • ​Medical and Environmental Data (Mash-up) Infrastructure (MEDMI)
    • Meteorological: temp, pressure, dewpoint temp, relative humidity
    • Pollens: alnus, ambrosia, artemesia, ..., urtica
  • ​​European Monitoring and Evaluation Programme (EMEP)
    • Model forecast data
    • ​NOx, NO2, SO2, O3, PM10, PM2.5
  • Complex extraction process:
    • Multiple data sources
    • Missing data (e.g. sensor down-time)
    • Variable UK area coverage

Cleaning and Imputation 

  • Remove duplicate/unphysical values​
  • Select sites by minimum temporal data coverage
  • scikit-learn (python) used to impute missing data using hourly time series
  • Imputation method
    • Bayesian Ridge
    • Quantile Transformer preprocessing
  • Final data: daily mean / maximum values (or simple daily count)

Regional estimations 

Diffusion method illustrated on fictional postcode regions

  • Regions where sensors exist: take mean
  • Regions with no sensors: take mean of surrounding regions
    • Working outwards until sensors found

Linking GitHub version to DOI

Shared Data

Links

  • FAIR Data: https://www.go-fair.org/fair-principles/

  • Git/GitHub Tutorial:  http://gcapes.github.io/git-course/

  • Gitflow Workflow: https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow

  • Semantic version numbers: https://semver.org/

  • Zenodo (DOIs) https://zenodo.org/

  • Linking GitHub to DOI:  https://guides.github.com/activities/citable-code/

  • Open data webinar - Helen Glaves (BGS):

    • https://www.youtube.com/channel/UCv8vRIuTxCP-DgNMCq9KxqA/videos

  • ​Computational Workflows

    • www.slideshare.net/carolegoble/fair-computational-workflows-249721518

Environment Dataset Links

  • 2016-2019 environment datasets:
    • measurements (original and imputed)
      • https://zenodo.org/record/4416028
      • includes link to extraction and imputation tool set
    • regional estimations (from original and imputed)
      • https://zenodo.org/record/4475652
      • includes link to region_estimators tool
    • Scientific Data paper:
      • https://www.nature.com/articles/s41597-022-01135-6
  • Visualisation Tool:
    • http://minethegaps.manchester.ac.uk/
    • https://github.com/UoMResearchIT/mine-the-gaps

Data and Software Sustainability

By Ann Gledson

Data and Software Sustainability

  • 352