
Reproducibility
using open data and open source
Ann Gledson, Research Software Engineer
The University of Manchester
 
https://slides.com/anngledson/data-software-sustainability


Overview

- Open Science
 - Open data, open source and open methodology
 - Environment data-set example
 

Open Science

- Open Data
 - Open Source
 - Open Methodology
 - Open Peer Review
 - Open Access
 - Open Educational Resources
 
For overview: see NERC 'Constructing a Digital Environment' channel: Open Science webinar by Helen Glaves (British Geological Survey)

The nature of research





- Research projects have short lifespans
 - Long-term ideas and goals
 - Non-continuous process
	
- Funding
 - Loss of key developers / analysts
 - Changes in scope or delays
 - Changes in storage devices
 
 -  Difficult to plan resources
	
- Data storage
 - Code re-use
 
 

Open Data

- Reduce repetition of data cleaning and wrangling tasks
 - Focus more on key research questions
 - Reproducibility -> Trust!
 - Visibility:
	
- E.g. Nature Scientific Data journal
 - Citations
 
 - Collaboration, communities and networking
 - Funding opportunities
 

Data wrangling



FAIR data principles



Nature Scientific Data: principles

Credit: Scientists who share their data in a FAIR manner deserve appropriate credit.
Re-use: Standardized and detailed descriptions make data easier to find and reuse.
Quality: Critical evaluation is needed to verify experimental rigour.
Discovery: Scientists should be able to easily find datasets that are relevant.
Open: Scientists work best when they can easily connect and collaborate.
Service: Committed to providing excellent service to both authors and readers.
(Full version: https://www.nature.com/sdata/about/principles)

Open Source

- Reduce repetition of coding effort
 - Community of developers
	
- Testing
 - Issue tracking
 - Pull requests
 
 - Reproducibility 
	
- Github releases: Snapshots of code
 
 - Visibility:
	
- Software Citation in Github
 - Findability
 
 





- Master branch is public release code – each version on master is tagged with a semantic version number (https://semver.org/). Code in master works.
 - Release branch (if present) is temporary. Preparation for version release.  (Or use develop)
 - Commits on develop are stable and should work but some unanticipated unexpected behaviour may be present.
 - Feature branches are temporary*. Commits on these may be broken/unstable.
 
Gitflow branch types


Gitflow release process
- Tests run on develop branch – before making a release, make sure that all tests are passing (can be run as a CI job).
 - (optional) Create an issue that describes what changes have been made since last release version. Make sure everyone agrees that behaviour is as expected. Link to passing tests.
 - Merge develop into master and tag with the version number.
 




Open Methodology
- Data analysis apps
	
- R-Shiney Apps
 - Jupyter Notebook
 - Streamlit
 - Plotly Dash
 - Bokeh
 
 - Computational Workflows
	
- FAIR workflows
 - Carole Goble UoM work (see links)
 
 

Example



Environment data

- 
Automatic Urban and Rural Network (AURN)
	
- NOx, SO2, O3, NO2, PM10, PM2.5
 
 - 
Medical and Environmental Data (Mash-up) Infrastructure (MEDMI)
	
- Meteorological: temp, pressure, dewpoint temp, relative humidity
 - Pollens: alnus, ambrosia, artemesia, ..., urtica
 
 - 
European Monitoring and Evaluation Programme (EMEP)
	
- Model forecast data
 - NOx, NO2, SO2, O3, PM10, PM2.5
 
 - 
Complex extraction process:
	
- Multiple data sources
 - Missing data (e.g. sensor down-time)
 - Variable UK area coverage
 
 


Cleaning and Imputation
- Remove duplicate/unphysical values
 - Select sites by minimum temporal data coverage
 - scikit-learn (python) used to impute missing data using hourly time series
 - 
Imputation method
	
- Bayesian Ridge
 - Quantile Transformer preprocessing
 
 - Final data: daily mean / maximum values (or simple daily count)
 






Regional estimations

Diffusion method illustrated on fictional postcode regions
- Regions where sensors exist: take mean
 - Regions with no sensors: take mean of surrounding regions
	
- Working outwards until sensors found
 
 




Linking GitHub version to DOI


Shared Data


Links

- 
	
FAIR Data: https://www.go-fair.org/fair-principles/
 - 
	
Git/GitHub Tutorial: http://gcapes.github.io/git-course/
 - 
	
Gitflow Workflow: https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow
 - 
	
Semantic version numbers: https://semver.org/
 - 
	
Zenodo (DOIs) https://zenodo.org/
 - 
	
Linking GitHub to DOI: https://guides.github.com/activities/citable-code/
 - 
	
Open data webinar - Helen Glaves (BGS):
- 
		
https://www.youtube.com/channel/UCv8vRIuTxCP-DgNMCq9KxqA/videos
 
 - 
		
 - 
	
Computational Workflows
- 
		
www.slideshare.net/carolegoble/fair-computational-workflows-249721518
 
 - 
		
 

Environment Dataset Links

- 2016-2019 environment datasets:
	
- measurements (original and imputed)
		
- https://zenodo.org/record/4416028
 - includes link to extraction and imputation tool set
 
 - regional estimations (from original and imputed)
		
- https://zenodo.org/record/4475652
 - includes link to region_estimators tool
 
 - Scientific Data paper:
		
- https://www.nature.com/articles/s41597-022-01135-6
 
 
 - measurements (original and imputed)
		
 - Visualisation Tool:
	
- http://minethegaps.manchester.ac.uk/
 - https://github.com/UoMResearchIT/mine-the-gaps
 
 
Data and Software Sustainability
By Ann Gledson
Data and Software Sustainability
- 855
 


