Scientific work and reproducibility
Sebastian Hörl
27 September 2021
Université Gustave Eiffel
Course plan
-
29 September (CM)
Scientific process and reproducibility
-
13 October (CM, afternoon)
Presentation and visualisation
Course plan
-
29 September (CM)
Scientific process and reproducibility
-
13 October (CM, afternoon)
Presentation and visualisation
-
13 October (TD, morning)
Preparation of mini project
-
9 November (CM)
Presentation and visualisation
Mini project
You will ...
- select an open data set of your choice,
- perform some processing and analysis on the data,
- document the process and make it reproducible,
- present your analysis results and approach, and
- write a short report.
Scientific process
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
- Internship report
- Master thesis
- Scientific paper
Results
Reproducibility
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
- How to make results and process reproducible?
- Process automation
- Open software
- Open data
Process automation
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
- Process automation makes processing steps repeatable.
Process automation
DS 1
DS 2
Process 1
Process 2
Process 3
DS 3
DS 4
DS 5
- What are the most stable data inputs? (Raw, official, data)
- How to structure the analysis in small, self-contained, repeatable steps?
- Pipeline should run automatically from source data to final output
Process automation
DS 1
DS 2
Process 1
Process 5
Process 3
DS 3
DS 4.1
DS 5.1
- What are the most stable data inputs? (Raw, official, data)
- How to structure the analysis in small, self-contained, repeatable steps?
- Pipeline should run automatically from source data to final output
- Develop incrementally by updating independent steps
Process automation
DS 1
DS 2
Process 1
Process 5
Process 3
DS 3
DS 4.1
DS 5.1
- What are the most stable data inputs? (Raw, official, data)
- How to structure the analysis in small, self-contained, repeatable steps?
- Pipeline should run automatically from source data to final output
- Develop incrementally by updating independent steps
- Extend processing pipeline step by step
Process 6
DS 6
Open software
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Open software
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
- Open software makes processing steps repeatable, leading to reproducible results.
Open software
- Source code of the software is open and publicly available
- "Free" software
- Can be used freely (as in freedom) by anybody
- Can be used freely (as in freedom) by anybody
- Historically, made available manually ...
- ... today on platforms such as Github or BitBucket
- Often managed by individual developers, foundations, or companies
Why develop open software?
-
Transparency: Let everybody know how the software works
-
Security / Validity: "One two hundred eyes see more than two"
-
Extensibility: Many developers can contribute to improve the software
- Research: Open many research pathways with novel ideas
- Research: Open many research pathways with novel ideas
-
Reproducibility: Research results can be reproduced and reused by others
- Funding: Several agencies have decided to only found open development with public funding
Why develop open software ... as a company?
- Developing a product vs. offering services
- Give users possibility to customize the product
- Offer customized versions and LTS (long-term support) versions
- Steer and manage a community / eco-system
Challenges
-
Long-term stability: What happens after a research project ends?
Companies, foundations, ...
-
Documentation and testing: Is it the highest priority?
Similar in privately funded software
-
Mixing and interconnections of software
Licensing!
-
Legal aspects
Intellectual property and licensing
Copyright vs licensing
-
Copyright
- Who created the software?
- Whose intellectual property is it?
- This is the person / organization that decides!
- Can you give up copyright? Depends on the country.
Public domain in the US
-
Licensing
- The copyright holder grants others certain rights to use, modify, etc.
- However, copyright stays at the initial author
Software licensing
- Some well-defined and accepted licenses
- GNU General Public License (GPL)
- Apache Software License
- MIT License
- BSD License
- Major types of open licenses
- Copy-left
- Permissible
Some examples ...
-
GPL (copyleft)
- You can freely use the code, make changes, and republish the code or the derived software
- If you republish anything, it must be published under the same license or compatible terms
- Effectively, GPL software is GPL-like again, so code must be open and reuseable
- You can freely use the code, make changes, and republish the code or the derived software
Some examples ...
- MIT
Copyright (c) <year> <copyright holders> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Some examples ...
- WTFPL
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2, December 2004 Copyright (C) 2004 Sam Hocevar <sam@hocevar.net> Everyone is permitted to copy and distribute verbatim or modified copies of this license document, and changing it is allowed as long as the name is changed. DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. You just DO WHAT THE FUCK YOU WANT TO.
Open data
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Open data
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
- Open data allows researchers to set up consistent pipelines from data sources to analysis results
Open data
- Data is publicly published and made available for reuse
- by researchers
- by governments
- by companies
- Recently became viable due to
- Larger IT infrastructure to save the data
- Expertise to set up data provider platforms
- Standardization of the platforms for public agencies
- Heavily supported by policies
- Open science policies
- Open government policies
Open data platforms
- UN Open Data
- https://data.un.org/
- https://data.un.org/
- World Bank
- https://data.worldbank.org/
- https://data.worldbank.org/
- Europe Open Data
- https://data.europa.eu
- INSEE in France
- https://www.insee.fr
- Paris Open Data
- https://opendata.paris.fr/
Open data sources
- Yelp Open Data
- https://www.yelp.com/dataset
- https://www.yelp.com/dataset
- Uber Movements
- https://movement.uber.com
- https://movement.uber.com
- OpenStreetMap
- https://www.openstreetmap.org
- https://www.openstreetmap.org
- Covid-19 Open Data
- https://github.com/GoogleCloudPlatform/covid-19-open-data
Data licensing
- Way less commonly used licenses
- Many with different names but only slight differences
- Problem of compatibility of licenses:
What can I put on OpenStreetMap?
- Examples
- Creative commons (BY / SA / NC / ND)
- ODbL (used by OpenStreetMap)
- Etalab (used by public agencies in France)
Privacy and anonymization
- Which data can be put openly online?
- Which approaches ensure anonymity?
More research needed (k-anonymity)
- Reflected in different data policies (example mobility)
- France: Systematic and informed publication of open data
- Switzerland: More data available, but only under very restricted terms
- Germany: Virtually no data available
Open source != open access
-
Some researchers argue that "open access" is about
-
making work reproducible
-
making work public
-
making work accessible to a wide audience
-
Tools: Version control
-
Idea: Track changes in a code base one by one
-
Consistent history of code changes (what, when, who)
-
Possibility to merge developments and revert changes
-
Collaborative development of program code
-
-
Solutions: CVS, Subversion, Mercurial, Git
-
Platforms: Bitbucket, Github, Gitlab
Tools: Version control
Initial
Alice
Bugfix
Bugfix
Bob
Bugfix
Start feature
Little fix
Finish
feature
Merge
Time
Example: Git
Tools: Unit testing
-
Idea: Write test code that makes sure that production code works as expected
-
Most of the code base (classes, functions) should be covered by tests
-
Tests can be run automatically for the whole project
-
Unintended side effects of code changes will be detected quickly
-
-
Solutions:
-
pytest (Python)
-
JUnit (Java)
-
...
-
Example: Python / Java
Tools: Continuous integration
-
Idea: Make merging in version control dependent on tests
Initial
Alice
Bugfix
Bugfix
Bob
Bugfix
Start feature
Little fix
Finish
feature
Merge
Time
Only if merged version passes all tests
Tools: Packaging
-
Idea 1: Package up code in libraries
-
Packages are published and available on a server
-
Package manager resolves dependencies and downloads packages
-
- Solutions:
- pip / Anaconda (Python)
- npm (node.js)
- gem (Ruby)
- Maven (Java)
- ...
Example: conda / mamba
Tools: Packaging
-
Idea 2: Define (and package) whole execution environments
-
Execution environment is set up consistently (versions of libraries, paths, ...)
-
Either packages of environments or definition files are transferred between users
-
- Solutions:
- Virtual environments, e.g. Anaconda / virtualenv (Python)
- Virtual machine ("emulate" a virtual computer), e.g. Vagrant
- Containers (provide image of a virtual machine), e.g. Docker, Singularity
Example: conda / mamba
Tools: Exploration and visualisation
- Data analysis requires quick testing and visualization of processing steps
- Solutions:
- Notebooks (R and Python / Jupyter)
- Visualization Grammars (e.g. Vega-Lite)
Tools: Automation
-
Idea: As shown before, automation makes results repeatable and adaptable.
- Solutions:
- Command line scripts
- Pipeline automation tools (e.g. snakemake)
- Custom automation and parameterization tools (e.g. papermill)
- Virtual machine ("emulate" a virtual computer), e.g. Vagrant
- Containers (provide image of a virtual machine), e.g. Docker, Singularity
Example: Papermill
Tools: Documentation
-
Idea: Word documents are static and require a lot of editing once results change or are added. Hence, create documentation and reports automatically.
- Solutions:
- Automatic code documentation
- Mark-up processors (LaTeX, Markdown, ...)
Example: LaTeX
Mini project
-
Select a data set that you find interesting and for which you want to do a brief analysis. Some examples will be given, but you can find inspiration on websites and newspapers where statistics and data is explained.
-
Explore the data with the tools of your choice (which should be automatable, e.g. R or Python).
13 October: Time to get feedback, show preliminary idea and advance work
-
Automate the analysis process to create visualization of the results as well as a small report.
9 November: Perform a short presentation of the mini project.
Grading
Report and presentation
- 50% Report: Implementation of mini project
- Reproducibility, Spelling / Grammar, Quality of analysis
- 50% Presentation
- Consistent style, font sizes, logical structure, storyline
Creative problem and solution
- Use of alternative tools, novel analyses, ...
= max. 15 pts
= max. 5 pts bonus
Grading criteria
Report
- Is spelling and grammar correct?
- Is the input data described sufficiently to follow the analysis?
- Are the analysis steps described such that they can be repeated?
- Is the analysis implemented in a replicable way (tools accessible, data accessible)?
- Does the report follow a logical structure (input, processing, output)?
- Does the report provide clear instructions to replicate the process? Does the implementation provide an automated workflow?
Presentation
- Is the design consistent?
- Have instructions on font sizes and resolution been taken into account?
- Does the presentation have a logical flow of information?
- Is the focus too broad / too specific?
- Has the time been estimated properly?
Some ideas
-
How has the amount of traffic in Paris changed during COVID-19? How does it change between summer and winter, in general? For cars or bicycles?
Uber Movement: https://movement.uber.com
Paris Open Data
Current road counts: Link
Historical road counts: Link
Velib Open Data with example
Some ideas
-
How have COVID-19 infections advanced over time? How have vaccinations advanced?
You'll find multiple open data sources for COVID-19 time series when searching on Google.
See here for France, e.g. to show the number of hospitalizations over time.
Some ideas
-
Visualize election results or predictions, for instance from the French regional elections.
Data could either be visualized as a bar chart by region, or linked with population data (see next slide). Also, maps would be possible, e.g. using tools such as geopandas or descartes in Python.
Some ideas
-
Visualize population growth, in general, by department, by age class, by socio-professional class, or others.
INSEE provides comprehensive census data in terms of sociodemographics and migration within and across France: Recensement 2017
Contact
Scientific work and reproducibility
By Sebastian Hörl
Scientific work and reproducibility
Université Gustave Eiffel, 29 September 2021
- 1,197