Scientific work and reproducibility

Sebastian Hörl

27 September 2021

Université Gustave Eiffel

Course plan

  • 29 September (CM)
    Scientific process and reproducibility



     
  • 13 October (CM, afternoon)
    Presentation and visualisation

Course plan

  • 29 September (CM)
    Scientific process and reproducibility



     
  • 13 October (CM, afternoon)
    Presentation and visualisation
  • 13 October (TD, morning)
    Preparation of mini project



     
  • 9 November (CM)
    Presentation and visualisation

Mini project

You will ...

  1. select an open data set of your choice,
  2. perform some processing and analysis on the data,
  3. document the process and make it reproducible,
  4. present your analysis results and approach, and
  5. write a short report.

Scientific process

Question / Hypothesis

Data

Processing

Analysis

Exploration

Documentation / Presentation

  • Internship report
  • Master thesis
  • Scientific paper

Results

Reproducibility

Question / Hypothesis

Data

Processing

Analysis

Exploration

Documentation / Presentation

Results

  • How to make results and process reproducible?
    • Process automation
    • Open software
    • Open data

Process automation

Question / Hypothesis

Data

Processing

Analysis

Exploration

Documentation / Presentation

Results

  • Process automation makes processing steps repeatable.

Process automation

DS 1

DS 2

Process 1

Process 2

Process 3

DS 3

DS  4

DS  5

  • What are the most stable data inputs? (Raw, official, data)
  • How to structure the analysis in small, self-contained, repeatable steps?
     
  • Pipeline should run automatically from source data to final output

Process automation

DS 1

DS 2

Process 1

Process 5

Process 3

DS 3

DS  4.1

DS  5.1

  • What are the most stable data inputs? (Raw, official, data)
  • How to structure the analysis in small, self-contained, repeatable steps?
     
  • Pipeline should run automatically from source data to final output
  • Develop incrementally by updating independent steps

Process automation

DS 1

DS 2

Process 1

Process 5

Process 3

DS 3

DS  4.1

DS  5.1

  • What are the most stable data inputs? (Raw, official, data)
  • How to structure the analysis in small, self-contained, repeatable steps?
     
  • Pipeline should run automatically from source data to final output
  • Develop incrementally by updating independent steps
  • Extend processing pipeline step by step

Process 6

DS  6

Open software

Question / Hypothesis

Data

Processing

Analysis

Exploration

Documentation / Presentation

Results

Open software

Question / Hypothesis

Data

Processing

Analysis

Exploration

Documentation / Presentation

Results

  • Open software makes processing steps repeatable, leading to reproducible results.

Open software

  • Source code of the software is open and publicly available
     
  • "Free" software
    • Can be used freely (as in freedom) by anybody
       
  • Historically, made available manually ...
  • ... today on platforms such as Github or BitBucket
     
  • Often managed by individual developers, foundations, or companies

Why develop open software?

  • Transparency: Let everybody know how the software works
     
  • Security / Validity: "One two hundred eyes see more than two"
     
  • Extensibility: Many developers can contribute to improve the software
    • Research: Open many research pathways with novel ideas
       
  • Reproducibility: Research results can be reproduced and reused by others
     
  • Funding: Several agencies have decided to only found open development with public funding

Why develop open software ... as a company?

  • Developing a product vs. offering services
     
  • Give users possibility to customize the product
  • Offer customized versions and LTS (long-term support) versions
     
  • Steer and manage a community / eco-system

Challenges

  • Long-term stability: What happens after a research project ends?
      Companies, foundations, ...
     
  • Documentation and testing: Is it the highest priority?
      Similar in privately funded software
     
  • Mixing and interconnections of software
      Licensing!
     
  • Legal aspects
      Intellectual property and licensing

Copyright vs licensing

  • Copyright
    • Who created the software?
    • Whose intellectual property is it?
    • This is the person / organization that decides!
       
    • Can you give up copyright? Depends on the country.
      Public domain in the US
       
  • Licensing
    • The copyright holder grants others certain rights to use, modify, etc.
    • However, copyright stays at the initial author

Software licensing

  • Some well-defined and accepted licenses
    • GNU General Public License (GPL)
    • Apache Software License
    • MIT License
    • BSD License
       
  • Major types of open licenses
    • Copy-left
    • Permissible

Some examples ...

  • GPL (copyleft)
    • You can freely use the code, make changes, and republish the code or the derived software
       
    • If you republish anything, it must be published under the same license or compatible terms
       
    • Effectively, GPL software is GPL-like again, so code must be open and reuseable

Some examples ...

  • MIT
Copyright (c) <year> <copyright holders>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Some examples ...

  • WTFPL
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                   Version 2, December 2004
 
Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>

Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
 
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
  TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

 0. You just DO WHAT THE FUCK YOU WANT TO.

Open data

Question / Hypothesis

Data

Processing

Analysis

Exploration

Documentation / Presentation

Results

Open data

Question / Hypothesis

Data

Processing

Analysis

Exploration

Documentation / Presentation

Results

  • Open data allows researchers to set up consistent pipelines from data sources to analysis results

Open data

  • Data is publicly published and made available for reuse
    • by researchers
    • by governments
    • by companies
       
  • Recently became viable due to
    • Larger IT infrastructure to save the data
    • Expertise to set up data provider platforms
    • Standardization of the platforms for public agencies
       
  • Heavily supported by policies
    • Open science policies
    • Open government policies

Open data platforms

  • UN Open Data
    • https://data.un.org/

  • World Bank
    • https://data.worldbank.org/

  • Europe Open Data
    • https://data.europa.eu
  • INSEE in France
    • https://www.insee.fr

 

  • Paris Open Data
    • https://opendata.paris.fr/

Open data sources

  • Yelp Open Data
    • https://www.yelp.com/dataset
       
  • Uber Movements
    • https://movement.uber.com
       
  • OpenStreetMap
    • https://www.openstreetmap.org
       
  • Covid-19 Open Data
    • https://github.com/GoogleCloudPlatform/covid-19-open-data

 

Data licensing

  • Way less commonly used licenses
  • Many with different names but only slight differences
     
  • Problem of compatibility of licenses:
    What can I put on OpenStreetMap?
     
  • Examples
    • Creative commons (BY / SA / NC / ND)
    • ODbL (used by OpenStreetMap)
    • Etalab (used by public agencies in France)

Privacy and anonymization

  • Which data can be put openly online?
     
  • Which approaches ensure anonymity?
      More research needed (k-anonymity)
     
  • Reflected in different data policies (example mobility)
    • France: Systematic and informed publication of open data
    • Switzerland: More data available, but only under very restricted terms
    • Germany: Virtually no data available

Open source != open access

  • Some researchers argue that "open access" is about

    • making work reproducible

    • making work public

    • making work accessible to a wide audience

Tools: Version control

  • Idea: Track changes in a code base one by one

    • Consistent history of code changes (what, when, who)

    • Possibility to merge developments and revert changes

    • Collaborative development of program code
       

  • Solutions: CVS, Subversion, Mercurial, Git

  • Platforms: Bitbucket, Github, Gitlab

Tools: Version control

Initial

Alice

Bugfix

Bugfix

Bob

Bugfix

Start feature

Little fix

Finish

feature

Merge

Time

Example: Git

Tools: Unit testing

  • Idea: Write test code that makes sure that production code works as expected

    • Most of the code base (classes, functions) should be covered by tests

    • Tests can be run automatically for the whole project

    • Unintended side effects of code changes will be detected quickly
       

  • Solutions:

    • pytest (Python)

    • JUnit (Java)

    • ...

Example: Python / Java

Tools: Continuous integration

  • Idea: Make merging in version control dependent on tests

Initial

Alice

Bugfix

Bugfix

Bob

Bugfix

Start feature

Little fix

Finish

feature

Merge

Time

Only if merged version passes all tests

Tools: Packaging

  • Idea 1: Package up code in libraries

    • Packages are published and available on a server

    • Package manager resolves dependencies and downloads packages
       

  • Solutions:
    • pip / Anaconda (Python)
    • npm (node.js)
    • gem (Ruby)
    • Maven (Java)
    • ...

Example: conda / mamba

Tools: Packaging

  • Idea 2: Define (and package) whole execution environments

    • Execution environment is set up consistently (versions of libraries, paths, ...)

    • Either packages of environments or definition files are transferred between users
       

  • Solutions:
    • Virtual environments, e.g. Anaconda / virtualenv (Python)
    • Virtual machine ("emulate" a virtual computer), e.g. Vagrant
    • Containers (provide image of a virtual machine), e.g. Docker, Singularity

Example: conda / mamba

Tools: Exploration and visualisation

  • Data analysis requires quick testing and visualization of processing steps

 

  • Solutions:
    • Notebooks (R and Python / Jupyter)
    • Visualization Grammars (e.g. Vega-Lite)

Example: Jupyter (Data)

Tools: Automation

  • Idea: As shown before, automation makes results repeatable and adaptable.
     

  • Solutions:
    • Command line scripts
    • Pipeline automation tools (e.g. snakemake)
    • Custom automation and parameterization tools (e.g. papermill)
    • Virtual machine ("emulate" a virtual computer), e.g. Vagrant
    • Containers (provide image of a virtual machine), e.g. Docker, Singularity

Example: Papermill

Tools: Documentation

  • Idea: Word documents are static and require a lot of editing once results change or are added. Hence, create documentation and reports automatically.
     

  • Solutions:
    • Automatic code documentation
    • Mark-up processors (LaTeX, Markdown, ...)

Example: LaTeX

Mini project

  1. Select a data set that you find interesting and for which you want to do a brief analysis. Some examples will be given, but you can find inspiration on websites and newspapers where statistics and data is explained.
     

  2. Explore the data with the tools of your choice (which should be automatable, e.g. R or Python).

    13 October: Time to get feedback, show preliminary idea and advance work
     

  3. Automate the analysis process to create visualization of the results as well as a small report.

    9 November: Perform a short presentation of the mini project.

Grading

Report and presentation

  • 50% Report: Implementation of mini project
    • Reproducibility, Spelling / Grammar, Quality of analysis
  • 50% Presentation
    • Consistent style, font sizes, logical structure, storyline

 

Creative problem and solution

  • Use of alternative tools, novel analyses, ...

= max. 15 pts

= max. 5 pts bonus

Grading criteria

Report

  • Is spelling and grammar correct?
  • Is the input data described sufficiently to follow the analysis?
  • Are the analysis steps described such that they can be repeated?
  • Is the analysis implemented in a replicable way (tools accessible, data accessible)?
  • Does the report follow a logical structure (input, processing, output)?
  • Does the report provide clear instructions to replicate the process? Does the implementation provide an automated workflow?

Presentation

  • Is the design consistent?
  • Have instructions on font sizes and resolution been taken into account?
  • Does the presentation have a logical flow of information?
  • Is the focus too broad / too specific?
  • Has the time been estimated properly?

Some ideas

  • How has the amount of traffic in Paris changed during COVID-19? How does it change between summer and winter, in general? For cars or bicycles?

    Uber Movement: https://movement.uber.com

    Paris Open Data
      Current road counts: Link
      Historical road counts: Link

    Velib Open Data with example
     

Some ideas

  • How have COVID-19 infections advanced over time? How have vaccinations advanced?

    You'll find multiple open data sources for COVID-19 time series when searching on Google.

    See here for France, e.g. to show the number of hospitalizations over time.

Some ideas

  • Visualize election results or predictions, for instance from the French regional elections.

    Data could either be visualized as a bar chart by region, or linked with population data (see next slide). Also, maps would be possible, e.g. using tools such as geopandas or descartes in Python.

Some ideas

  • Visualize population growth, in general, by department, by age class, by socio-professional class, or others.

    INSEE provides comprehensive census data in terms of sociodemographics and migration within and across France: Recensement 2017
     

Contact

Scientific work and reproducibility

By Sebastian Hörl

Scientific work and reproducibility

Université Gustave Eiffel, 29 September 2021

  • 1,197