MADRaT
May all data be reproducible and transparent
Jan Philipp Dietrich
dixyzetrich@pik-xyzpotsdam.de
Data processing via MADRaT
powered by madrat » github.com/pik-piam/madrat
R-based data processing framework
Open Source
BDS-2 license
Facilitate reproducibility through sharing of data processing workflows
Make workflow snippets reusable through standardization
Improve robustness through integrated testing
Increase transparency through automatic metadata generation and handling
Reproducibility
Sharing the data of a study is important
Sharing the code which produced the data is even more important
# How it could look like for
# a collection of data
install.packages("mrMyPaper")
library(mrMyPaper)
retrieveData("MyPaperData")
Reusability
Build on existing code instead of existing data
Use existing pieces in new workflows
Robustness
Testing helps to identify problems in an early stage
Are the values within the expected range?
Is the data in the expected format?
Does the data contain NAs?
Run calcOutput(type = "ValidLandChange", datasource = "SSPResults")
> Run calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE)
>> - force cache >> /p/projects/rd3mod/inputdata/cache/rev4.52jpdtest6/calcValidLand-16524e79d5da215a1c1ab88617221e82.rds
>> WARNING: Data returned by mrvalidation:::calcValidLand(...) contains values smaller than the predefined minimum (min = 0 )
> Exit calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE) in 0.2 seconds
Exit calcOutput(type = "ValidLandChange", datasource = "SSPResults") in 0.82 seconds
Transparency
Metadata helps to put data into context
* description: Pasture yields
* unit: ton DM per ha
* origin: calcOutput(type = "PastureYield", file = "f14_pasture_yields_hist.csv", round = 3) (madrat 1.82.0 | mrland 0.3.2)
* creation date: Wed Sep 9 10:13:24 2020
dummy,LAM,OAS,SSA,EUR,NEU,MEA,REF,CAZ,CHA,IND,JPN,USA
y1965,1.182,1.225,0.47,6.59,5.091,0.408,0.859,0.556,1.639,20.865,12.897,1.438
y1970,1.234,1.28,0.49,6.745,5.245,0.424,0.861,0.577,1.564,21.572,13.416,1.54
title: Tau Factor (historic trends)
description: Historic land use intensity (tau) development
author:
- given: Jan Philipp
family: Dietrich
role: ~
email: dietrich@pik-potsdam.de
comment: https://orcid.org/0000-0002-4309-6431
doi: 10.5281/zenodo.4282548
url: https://zenodo.org/record/4282548/files/tau-historical.zip
accessibility: gold
license: Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA
4.0)
version: '1.0'
release_date: '2012-05-10'
unit: '1'
call:
origin: downloadSource(type = type, subtype = subtype) -> madrat:::downloadTau(subtype=subtype)
(madrat 1.93.3 | madrat 1.93.3)
type: Tau
subtype: historical
time: 2021-02-23 11:20:34 CET
reference:
- title: Measuring agricultural land-use intensity - A global analysis using a model-assisted
approach
author:
- given: Jan Philipp
family: Dietrich
role: ~
email: dietrich@pik-potsdam.de
comment: https://orcid.org/0000-0002-4309-6431
- given: Christoph
family: Schmitz
role: ~
email: ~
comment: ~
- given: Christoph
family: Mueller
role: ~
email: ~
comment: ~
- given: Marianela
family: Fader
role: ~
email: ~
comment: ~
- given: Hermann
family: Lotze-Campen
role: ~
email: ~
comment: ~
- given: Alexander
family: Popp
role: ~
email: ~
comment: ~
year: '2012'
journal: Ecological Modelling
volume: '232'
pages: 109-118
url: https://doi.org/10.1016/j.ecolmodel.2012.03.002
doi: 10.1016/j.ecolmodel.2012.03.002
Structured data processing
powered by madrat » github.com/pik-piam/madrat
retrieveData: bundle data sets
readSource: download and read source data
calcOutput: perform calculations on the data (filtering, merging,...)
data processing split into distinct steps
wrapper provide controlled environment for user-written code
wrapper
user code
Structured data processing
Blackbox script
Building workflows
powered by madrat » github.com/pik-piam/madrat
use building blocks to create data processing workflows
remix existing workflows
reuse snippets from other workflows
Analyzing MADRaT network
a <- getMadratInfo()
.:: Check network size ::.
[INFO] 176 read functions (called 420 times, 2.39 calls on average)
[INFO] 433 calc functions (called 994 times, 2.3 calls on average)
[INFO] 52 tool functions (called 778 times, 14.96 calls on average)
[INFO] 13 retrieve functions (triggering 277 calls, 21.31 calls on average)
...
findBottlenecks("log.txt")
Total runtime: 2 hours 6 minutes 18 seconds
level class type time[min] time[%] net[min] net[%]
562 4 calc FAOForestRelocate 33.04 26.16 32.49 25.72
181 3 read LUH2v2 14.29 11.32 14.29 11.32
123 2 calc FAOmassbalance_pre 11.22 8.88 8.60 6.81
560 6 read LPJmL 7.46 5.91 7.46 5.91
177 3 read IEA 7.42 5.88 7.37 5.83
95 1 read Lutz2014 4.62 3.66 4.61 3.65
110 4 read FAO 4.52 3.58 4.50 3.56
1224 1 read ISIMIP 3.84 3.04 3.84 3.04
990 3 read MAgPIE 3.59 2.84 3.49 2.76
201 0 calc BodyHeight 3.51 2.78 3.46 2.74
...
getDependencies("calcBodyHeight")
func type package
1 calcDemography calc mrcommons
2 calcPopulation calc mrcommons
3 calcGDPpc calc mrcommons
4 calcPopulationPast calc mrcommons
5 calcPopulationFuture calc mrcommons
6 calcCollectProjectionDrivers calc mrcommons
7 calcGDPppp calc mrcommons
8 calcUrban calc mrcommons
9 calcGDPpppPast calc mrcommons
10 calcGDPpppFuture calc mrcommons
11 calcUrbanPast calc mrcommons
12 calcUrbanFuture calc mrcommons
13 readNCDrisc read mrcommons
14 readLutz2014 read mrcommons
15 readBodirsky2018 read mrcommons
16 readWDI read mrcommons
17 readMissingIslands read mrcommons
18 readUN_PopDiv read mrcommons
19 readSSP read mrcommons
20 readSRES read mrcommons
21 readIIASApop read mrcommons
22 readPopulationTWN read mrcommons
23 readJames read mrcommons
24 readPWT read mrcommons
25 readOECD read mrcommons
26 readJames2019 read mrplayground
27 toolMappingFile tool madrat
28 toolAggregate tool madrat
29 toolCountryFill tool madrat
30 toolCountry2isocode tool madrat
31 toolGetMapping tool madrat
32 toolSubtypeSelect tool madrat
33 toolHoldConstantBeyondEnd tool mrcommons
34 toolCountryCode2isocode tool mrcommons
35 toolHoldConstant tool mstools
Quirks
madrat had been development for a very specific use-case and is now broadening
wrapper do not cover all potential problems
some design decisions might be unexpected (e.g. focus on country data)
adaptation takes time (e.g. lack of download functions)
data exchange format (magclass) not well known
data sources might change their format
user-written code still can introduce problems
structure does not suit every application equally well
tl;dr
MADRaT structures data processing
MADRaT takes over tasks
(e.g. metadata handling,
testing, monitoring)
MADRaT increases accessibility
of data processing workflows
Run calcOutput(type = "ValidLandChange", datasource = "SSPResults")
> Run calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE)
>> - force cache >> /p/projects/rd3mod/inputdata/cache/rev4.52jpdtest6/calcValidLand-16524e79d5da215a1c1ab88617221e82.rds
>> WARNING: Data returned by mrvalidation:::calcValidLand(...) contains values smaller than the predefined minimum (min = 0 )
> Exit calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE) in 0.2 seconds
Exit calcOutput(type = "ValidLandChange", datasource = "SSPResults") in 0.82 seconds
Further Reading
This presentation - slides.com/jandietrich/madrat
MADRaT repository - https://github.com/pik-piam/madrat
MADRaT tutorial | pik-piam.r-universe.dev/articles/madrat/madrat.html
MADRaT-based packages | github.com/pik-piam?q=mr
Dietrich J, Baumstark L, Wirth S, Giannousakis A, Rodrigues R, Bodirsky B, Kreidenweis U, Klein D, Führlich P (2022). madrat: May All Data be Reproducible and Transparent (MADRaT). doi: 10.5281/zenodo.1115490 ,
R package version 2.8.0, URL: https://github.com/pik-piam/madrat
contact me | dietrich@pik-potsdam.de
MADRaT
By Jan Dietrich
MADRaT
Brief introduction to the MADRaT ("May all data be reproducible and transparent") framework.
- 168