Scientific work and reproducibility
Sebastian Hörl
27 September 2021
Université Gustave Eiffel
Course plan
Course plan
Mini project
You will ...
Scientific process
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Reproducibility
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Process automation
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Process automation
DS 1
DS 2
Process 1
Process 2
Process 3
DS 3
DS 4
DS 5
Process automation
DS 1
DS 2
Process 1
Process 5
Process 3
DS 3
DS 4.1
DS 5.1
Process automation
DS 1
DS 2
Process 1
Process 5
Process 3
DS 3
DS 4.1
DS 5.1
Process 6
DS 6
Open software
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Open software
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Open software
Why develop open software?
Why develop open software ... as a company?
Challenges
Copyright vs licensing
Software licensing
Some examples ...
Some examples ...
Copyright (c) <year> <copyright holders> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Some examples ...
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2, December 2004 Copyright (C) 2004 Sam Hocevar <sam@hocevar.net> Everyone is permitted to copy and distribute verbatim or modified copies of this license document, and changing it is allowed as long as the name is changed. DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. You just DO WHAT THE FUCK YOU WANT TO.
Open data
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Open data
Question / Hypothesis
Data
Processing
Analysis
Exploration
Documentation / Presentation
Results
Open data
Open data platforms
Open data sources
Data licensing
Privacy and anonymization
Open source != open access
Some researchers argue that "open access" is about
making work reproducible
making work public
making work accessible to a wide audience
Tools: Version control
Idea: Track changes in a code base one by one
Consistent history of code changes (what, when, who)
Possibility to merge developments and revert changes
Collaborative development of program code
Solutions: CVS, Subversion, Mercurial, Git
Platforms: Bitbucket, Github, Gitlab
Tools: Version control
Initial
Alice
Bugfix
Bugfix
Bob
Bugfix
Start feature
Little fix
Finish
feature
Merge
Time
Example: Git
Tools: Unit testing
Idea: Write test code that makes sure that production code works as expected
Most of the code base (classes, functions) should be covered by tests
Tests can be run automatically for the whole project
Unintended side effects of code changes will be detected quickly
Solutions:
pytest (Python)
JUnit (Java)
...
Example: Python / Java
Tools: Continuous integration
Idea: Make merging in version control dependent on tests
Initial
Alice
Bugfix
Bugfix
Bob
Bugfix
Start feature
Little fix
Finish
feature
Merge
Time
Only if merged version passes all tests
Tools: Packaging
Idea 1: Package up code in libraries
Packages are published and available on a server
Package manager resolves dependencies and downloads packages
Example: conda / mamba
Tools: Packaging
Idea 2: Define (and package) whole execution environments
Execution environment is set up consistently (versions of libraries, paths, ...)
Either packages of environments or definition files are transferred between users
Example: conda / mamba
Tools: Exploration and visualisation
Tools: Automation
Idea: As shown before, automation makes results repeatable and adaptable.
Example: Papermill
Tools: Documentation
Idea: Word documents are static and require a lot of editing once results change or are added. Hence, create documentation and reports automatically.
Example: LaTeX
Mini project
Select a data set that you find interesting and for which you want to do a brief analysis. Some examples will be given, but you can find inspiration on websites and newspapers where statistics and data is explained.
Explore the data with the tools of your choice (which should be automatable, e.g. R or Python).
13 October: Time to get feedback, show preliminary idea and advance work
Automate the analysis process to create visualization of the results as well as a small report.
9 November: Perform a short presentation of the mini project.
Grading
Report and presentation
Creative problem and solution
= max. 15 pts
= max. 5 pts bonus
Grading criteria
Report
Presentation
Some ideas
How has the amount of traffic in Paris changed during COVID-19? How does it change between summer and winter, in general? For cars or bicycles?
Uber Movement: https://movement.uber.com
Paris Open Data
Current road counts: Link
Historical road counts: Link
Velib Open Data with example
Some ideas
How have COVID-19 infections advanced over time? How have vaccinations advanced?
You'll find multiple open data sources for COVID-19 time series when searching on Google.
See here for France, e.g. to show the number of hospitalizations over time.
Some ideas
Visualize election results or predictions, for instance from the French regional elections.
Data could either be visualized as a bar chart by region, or linked with population data (see next slide). Also, maps would be possible, e.g. using tools such as geopandas or descartes in Python.
Some ideas
Visualize population growth, in general, by department, by age class, by socio-professional class, or others.
INSEE provides comprehensive census data in terms of sociodemographics and migration within and across France: Recensement 2017
Contact