MADRaT

May all data be reproducible and transparent

Jan Philipp Dietrich
dixyzetrich@pik-xyzpotsdam.de

Data processing via MADRaT

powered by madrat » github.com/pik-piam/madrat

R-based data processing framework
 

Open Source
BDS-2 license

Facilitate reproducibility through sharing of data processing workflows 

Make workflow snippets reusable through standardization

Improve robustness through integrated testing

Increase transparency through automatic metadata generation and handling

Reproducibility

Sharing the data of a study is important 

Sharing the code which produced the data is even more important 

# How it could look like for
# a collection of data 
install.packages("mrMyPaper")
library(mrMyPaper)
retrieveData("MyPaperData")

Reusability

Build on existing code instead of existing data

Use existing pieces in new workflows

Robustness

Testing helps to identify problems in an early stage

Are the values within the expected range?

Is the data in the expected format?

Does the data contain NAs?

Run calcOutput(type = "ValidLandChange", datasource = "SSPResults")
> Run calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE)
>>  - force cache >> /p/projects/rd3mod/inputdata/cache/rev4.52jpdtest6/calcValidLand-16524e79d5da215a1c1ab88617221e82.rds
>> WARNING: Data returned by  mrvalidation:::calcValidLand(...) contains values smaller than the predefined minimum (min =  0 )
> Exit calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE) in 0.2 seconds
Exit calcOutput(type = "ValidLandChange", datasource = "SSPResults") in 0.82 seconds

Transparency

Metadata helps to put data into context

* description: Pasture yields
* unit: ton DM per ha
* origin: calcOutput(type = "PastureYield", file = "f14_pasture_yields_hist.csv", round = 3) (madrat 1.82.0 | mrland 0.3.2)
* creation date: Wed Sep  9 10:13:24 2020
dummy,LAM,OAS,SSA,EUR,NEU,MEA,REF,CAZ,CHA,IND,JPN,USA
y1965,1.182,1.225,0.47,6.59,5.091,0.408,0.859,0.556,1.639,20.865,12.897,1.438
y1970,1.234,1.28,0.49,6.745,5.245,0.424,0.861,0.577,1.564,21.572,13.416,1.54
title: Tau Factor (historic trends)
description: Historic land use intensity (tau) development
author:
- given: Jan Philipp
  family: Dietrich
  role: ~
  email: dietrich@pik-potsdam.de
  comment: https://orcid.org/0000-0002-4309-6431
doi: 10.5281/zenodo.4282548
url: https://zenodo.org/record/4282548/files/tau-historical.zip
accessibility: gold
license: Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA
  4.0)
version: '1.0'
release_date: '2012-05-10'
unit: '1'
call:
  origin: downloadSource(type = type, subtype = subtype) -> madrat:::downloadTau(subtype=subtype)
    (madrat 1.93.3 | madrat 1.93.3)
  type: Tau
  subtype: historical
  time: 2021-02-23 11:20:34 CET
reference:
- title: Measuring agricultural land-use intensity - A global analysis using a model-assisted
    approach
  author:
  - given: Jan Philipp
    family: Dietrich
    role: ~
    email: dietrich@pik-potsdam.de
    comment: https://orcid.org/0000-0002-4309-6431
  - given: Christoph
    family: Schmitz
    role: ~
    email: ~
    comment: ~
  - given: Christoph
    family: Mueller
    role: ~
    email: ~
    comment: ~
  - given: Marianela
    family: Fader
    role: ~
    email: ~
    comment: ~
  - given: Hermann
    family: Lotze-Campen
    role: ~
    email: ~
    comment: ~
  - given: Alexander
    family: Popp
    role: ~
    email: ~
    comment: ~
  year: '2012'
  journal: Ecological Modelling
  volume: '232'
  pages: 109-118
  url: https://doi.org/10.1016/j.ecolmodel.2012.03.002
  doi: 10.1016/j.ecolmodel.2012.03.002

Structured data processing

powered by madrat » github.com/pik-piam/madrat

retrieveData: bundle data sets

readSource: download and read source data 

calcOutput: perform calculations on the data (filtering, merging,...)

data processing split into distinct steps

wrapper provide controlled environment  for user-written code

wrapper 

user code

Structured data processing

Blackbox script

Building workflows

powered by madrat » github.com/pik-piam/madrat

use building blocks to create data processing workflows

remix existing workflows

reuse snippets from other workflows

Analyzing MADRaT network

 a <- getMadratInfo()
.:: Check network size ::.
[INFO] 176 read functions (called 420 times, 2.39 calls on average)
[INFO] 433 calc functions (called 994 times, 2.3 calls on average)
[INFO] 52 tool functions (called 778 times, 14.96 calls on average)
[INFO] 13 retrieve functions (triggering 277 calls, 21.31 calls on average)
...
findBottlenecks("log.txt")  
Total runtime: 2 hours 6 minutes 18 seconds
     level class               type time[min] time[%] net[min] net[%]
562      4  calc  FAOForestRelocate     33.04   26.16    32.49  25.72
181      3  read             LUH2v2     14.29   11.32    14.29  11.32
123      2  calc FAOmassbalance_pre     11.22    8.88     8.60   6.81
560      6  read              LPJmL      7.46    5.91     7.46   5.91
177      3  read                IEA      7.42    5.88     7.37   5.83
95       1  read           Lutz2014      4.62    3.66     4.61   3.65
110      4  read                FAO      4.52    3.58     4.50   3.56
1224     1  read             ISIMIP      3.84    3.04     3.84   3.04
990      3  read             MAgPIE      3.59    2.84     3.49   2.76
201      0  calc         BodyHeight      3.51    2.78     3.46   2.74
...
getDependencies("calcBodyHeight")
                           func type      package
1                calcDemography calc    mrcommons
2                calcPopulation calc    mrcommons
3                     calcGDPpc calc    mrcommons
4            calcPopulationPast calc    mrcommons
5          calcPopulationFuture calc    mrcommons
6  calcCollectProjectionDrivers calc    mrcommons
7                    calcGDPppp calc    mrcommons
8                     calcUrban calc    mrcommons
9                calcGDPpppPast calc    mrcommons
10             calcGDPpppFuture calc    mrcommons
11                calcUrbanPast calc    mrcommons
12              calcUrbanFuture calc    mrcommons
13                  readNCDrisc read    mrcommons
14                 readLutz2014 read    mrcommons
15             readBodirsky2018 read    mrcommons
16                      readWDI read    mrcommons
17           readMissingIslands read    mrcommons
18                readUN_PopDiv read    mrcommons
19                      readSSP read    mrcommons
20                     readSRES read    mrcommons
21                 readIIASApop read    mrcommons
22            readPopulationTWN read    mrcommons
23                    readJames read    mrcommons
24                      readPWT read    mrcommons
25                     readOECD read    mrcommons
26                readJames2019 read mrplayground
27              toolMappingFile tool       madrat
28                toolAggregate tool       madrat
29              toolCountryFill tool       madrat
30          toolCountry2isocode tool       madrat
31               toolGetMapping tool       madrat
32            toolSubtypeSelect tool       madrat
33    toolHoldConstantBeyondEnd tool    mrcommons
34      toolCountryCode2isocode tool    mrcommons
35             toolHoldConstant tool      mstools

Quirks

madrat had been development for a very specific use-case and is now broadening

wrapper do not cover all potential problems

some design decisions might be unexpected (e.g. focus on country data)

adaptation takes time (e.g. lack of download functions)

data exchange format (magclass) not well known 

data sources might change their format

user-written code still can introduce problems

structure does not suit every application equally well

tl;dr 

 

MADRaT structures data processing

 

 

 

  MADRaT takes over tasks

                        (e.g. metadata handling,

testing, monitoring)

      

 

MADRaT increases accessibility

of data processing workflows

Run calcOutput(type = "ValidLandChange", datasource = "SSPResults")
> Run calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE)
>>  - force cache >> /p/projects/rd3mod/inputdata/cache/rev4.52jpdtest6/calcValidLand-16524e79d5da215a1c1ab88617221e82.rds
>> WARNING: Data returned by  mrvalidation:::calcValidLand(...) contains values smaller than the predefined minimum (min =  0 )
> Exit calcOutput(type = "ValidLand", datasource = datasource, aggregate = FALSE) in 0.2 seconds
Exit calcOutput(type = "ValidLandChange", datasource = "SSPResults") in 0.82 seconds

Further Reading

This presentation - slides.com/jandietrich/madrat

MADRaT repository - https://github.com/pik-piam/madrat

 

 

MADRaT tutorial | pik-piam.r-universe.dev/articles/madrat/madrat.html
MADRaT-based packages | github.com/pik-piam?q=mr                                                   

 

 

Dietrich J, Baumstark L, Wirth S, Giannousakis A, Rodrigues R, Bodirsky B, Kreidenweis U, Klein D, Führlich P (2022). madrat: May All Data be Reproducible and Transparent (MADRaT). doi: 10.5281/zenodo.1115490 ,
R package version 2.8.0, URL:
https://github.com/pik-piam/madrat

 

        contact me | dietrich@pik-potsdam.de