Reproducibility

using open data and open source

Ann Gledson, Research Software Engineer

The University of Manchester

https://slides.com/anngledson/data-software-sustainability

Overview

Open Science
Open data, open source and open methodology
Environment data-set example

Open Science

Open Data
Open Source
Open Methodology
Open Peer Review
Open Access
Open Educational Resources

For overview: see NERC 'Constructing a Digital Environment' channel: Open Science webinar by Helen Glaves (British Geological Survey)

The nature of research

Research projects have short lifespans
Long-term ideas and goals
Non-continuous process
- Funding
- Loss of key developers / analysts
- Changes in scope or delays
- Changes in storage devices
Difficult to plan resources
- Data storage
- Code re-use

Open Data

Reduce repetition of data cleaning and wrangling tasks
Focus more on key research questions
Reproducibility -> Trust!
Visibility:
- E.g. Nature Scientific Data journal
- Citations
Collaboration, communities and networking
Funding opportunities

Data wrangling

FAIR data principles

Nature Scientific Data: principles

Credit: Scientists who share their data in a FAIR manner deserve appropriate credit.

Re-use: Standardized and detailed descriptions make data easier to find and reuse.

Quality: Critical evaluation is needed to verify experimental rigour.

Discovery: Scientists should be able to easily find datasets that are relevant.

Open: Scientists work best when they can easily connect and collaborate.

Service: Committed to providing excellent service to both authors and readers.

(Full version: https://www.nature.com/sdata/about/principles)

Open Source

Reduce repetition of coding effort
Community of developers
- Testing
- Issue tracking
- Pull requests
Reproducibility
- Github releases: Snapshots of code
Visibility:
- Software Citation in Github
- Findability

Master branch is public release code – each version on master is tagged with a semantic version number (https://semver.org/). Code in master works.
Release branch (if present) is temporary. Preparation for version release. (Or use develop)
Commits on develop are stable and should work but some unanticipated unexpected behaviour may be present.
Feature branches are temporary*. Commits on these may be broken/unstable.

Gitflow branch types

Gitflow release process

Tests run on develop branch – before making a release, make sure that all tests are passing (can be run as a CI job).
(optional) Create an issue that describes what changes have been made since last release version. Make sure everyone agrees that behaviour is as expected. Link to passing tests.
Merge develop into master and tag with the version number.

Open Methodology

Data analysis apps
- R-Shiney Apps
- Jupyter Notebook
- Streamlit
- Plotly Dash
- Bokeh
Computational Workflows
- FAIR workflows
- Carole Goble UoM work (see links)

Example

Environment data

Automatic Urban and Rural Network (AURN)
- NOx, SO2, O3, NO2, PM10, PM2.5
Medical and Environmental Data (Mash-up) Infrastructure (MEDMI)
- Meteorological: temp, pressure, dewpoint temp, relative humidity
- Pollens: alnus, ambrosia, artemesia, ..., urtica
European Monitoring and Evaluation Programme (EMEP)
- Model forecast data
- NOx, NO2, SO2, O3, PM10, PM2.5
Complex extraction process:
- Multiple data sources
- Missing data (e.g. sensor down-time)
- Variable UK area coverage

Cleaning and Imputation

Remove duplicate/unphysical values
Select sites by minimum temporal data coverage
scikit-learn (python) used to impute missing data using hourly time series
Imputation method
- Bayesian Ridge
- Quantile Transformer preprocessing
Final data: daily mean / maximum values (or simple daily count)

Regional estimations

Diffusion method illustrated on fictional postcode regions

Regions where sensors exist: take mean
Regions with no sensors: take mean of surrounding regions
- Working outwards until sensors found

Linking GitHub version to DOI

Shared Data

Environment Dataset Links

2016-2019 environment datasets:
- measurements (original and imputed)
  - https://zenodo.org/record/4416028
  - includes link to extraction and imputation tool set
- regional estimations (from original and imputed)
  - https://zenodo.org/record/4475652
  - includes link to region_estimators tool
- Scientific Data paper:
  - https://www.nature.com/articles/s41597-022-01135-6
Visualisation Tool:
- http://minethegaps.manchester.ac.uk/
- https://github.com/UoMResearchIT/mine-the-gaps

Data and Software Sustainability

By Ann Gledson

Data and Software Sustainability

Reproducibility

using open data and open source

Ann Gledson, Research Software Engineer

The University of Manchester

Overview

Open Science

The nature of research

Open Data

Data wrangling

FAIR data principles

Nature Scientific Data: principles

Open Source

Example

Environment data

Cleaning and Imputation

Regional estimations

Links

Environment Dataset Links

Data and Software Sustainability

More from Ann Gledson