Open software and data in research

Sebastian Hörl

23 November 2020

Mertonian norms

Communism: Universal ownership of scientific goods and collective advancement

Universalism: Scientific validity is not related to positionality of the researcher.

Disinterestedness: Research for the common good, not individual / organizational interest.

Organized scepticism: Scientific work is constantly subject to critical evaluation.

Reproducibility

Data

Algorithms

Analysis

Results

Reproducibility

Data

Algorithms

Analysis

Results

Validation / Falsification

Results

Verification

Analysis

Replication

Algorithms

Data

?

?

Open Software

  • Source code of the software is open and publicly available
     
  • "Free" software
    • Can be used freely (as in freedom) by anybody
       
  • Historically, made available manually ...
  • ... today on platforms such as Github or BitBucket
     
  • Often managed by individual developers, foundations, or companies

Why develop in open source?

  • Transparency: Let everybody know how the software works
     
  • Security / Validity: "One two hundred eyes see more than two"
     
  • Extensibility: Many developers can contribute to improve the software
    • Research: Open many research pathways with novel ideas
       
  • Reproducibility: Research results can be reproduced and reused by others
     
  • Funding: Several agencies have decided to only found open development with public funding

Why develop in open source ... as a company?

  • Developing a product vs. offering services
     
  • Give users possibility to customize the product
  • Offer customized versions and LTS (long-term support) versions
     
  • Steer and manage a community / eco-system

Challenges

  • Long-term stability: What happens after a research project ends?
      Companies, foundations, ...
     
  • Documentation and testing: Is it the highest priority?
      Similar in privately funded software
     
  • Mixing and interconnections of software
      Licensing!
     
  • Legal aspects
      Intellectual property and licensing

Copyright vs licensing

  • Copyright
    • Who created the software?
    • Whose intellectual property is it?
    • This is the person / organization that decides!
       
    • Can you give up copyright? Depends on the country.
      Public domain in the US
       
  • Licensing
    • The copyright holder grants others certain rights to use, modify, etc.
    • However, copyright stays at the initial author

Software licensing

  • Some well-defined and accepted licenses
    • GNU General Public License (GPL)
    • Apache Software License
    • MIT License
    • BSD License
       
  • Major types of open licenses
    • Copy-left
    • Permissible

Some examples ...

  • GPL (copyleft)
    • You can freely use the code, make changes, and republish the code or the derived software
       
    • If you republish anything, it must be published under the same license or compatible terms
       
    • Effectively, GPL software is GPL-like again, so code must be open and reuseable

Some examples ...

  • MIT
Copyright (c) <year> <copyright holders>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Some examples ...

  • WTFPL
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                   Version 2, December 2004
 
Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>

Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
 
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
  TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

 0. You just DO WHAT THE FUCK YOU WANT TO.

Open data

  • Data is publicly published and made available for reuse
    • by researchers
    • by governments
    • by companies
       
  • Recently became viable due to
    • Larger IT infrastructure to save the data
    • Expertise to set up data provider platforms
    • Standardization of the platforms for public agencies
       
  • Heavily supported by policies
    • Open science policies
    • Open government policies

Open data

  • Data is publicly published and made available for reuse
    • by researchers
    • by governments
    • by companies
       
  • Recently became viable due to
    • Larger IT infrastructure to save the data
    • Expertise to set up data provider platforms
    • Standardization of the platforms for public agencies
       
  • Heavily supported by policies
    • Open science policies
    • Open government policies

Examples of platforms

  • UN Open Data
    • https://data.un.org/

  • World Bank
    • https://data.worldbank.org/

  • Europe Open Data
    • https://data.europa.eu
  • INSEE in France
    • https://www.insee.fr

 

  • Paris Open Data
    • https://opendata.paris.fr/

Examples of data initiatives

  • Yelp Open Data
    • https://www.yelp.com/dataset
       
  • Uber Movements
    • https://movement.uber.com
       
  • OpenStreetMap
    • https://www.openstreetmap.org

 

Data licensing

  • Way less commonly used licenses
  • Many with different names but only slight differences
     
  • Problem of compatibility of licenses:
    What can I put on OpenStreetMap?
     
  • Examples
    • Creative commons (BY / SA / NC / ND)
    • ODbL (used by OpenStreetMap)
    • Etalab (used by public agencies in France)

Privacy and anonymization

  • Which data can be put openly online?
     
  • Which approaches ensure anonymity?
      More research needed (k-anonymity)
     
  • Reflected in different data policies
    • France: Systematic and informed publication of open data
    • Switzerland: More data available, but only under very restricted terms
    • Germany: Virtually no data available

Open source != open access

  • Lovelace, R., 2020. Open access transport models: A leverage point in sustainable transport planning. Transport Policy 8.
     

  • "Open access" is about

    • making work reproducible

    • making work public

    • making work accessible to a wide audience

Using open data and software in transport research

  • Coming exercise sessions

    • Problems will be shown in detail during the sessions

    • If you want, try to follow the instructions in advance

    • Can you come up with more advanced analysis / fusion with other data?
       

  • Coming presentation

    • Use of open simulator MATSim for automated vehicle simulation

    • Use of open data in France / Paris to create synthetic populations

    • Exercise session: Do you see other use cases?

Thank you!

Questions ?

Open software and data in research

By Sebastian Hörl

Open software and data in research

23 November 2020

  • 664