TOOLS and BEST PRACTICES in QUANTItative Research

Mercè Crosas, IQSS, Harvard University
@mercecrosas

XXXVII Jornadas de Economía de la Salud,
Barcelona, 6-8 SePtiembre

Some Examples

 OF WHAT you can do using
QUANTITATIVE DATA and METHODS

  • quantitative and geospatial data
  • unstructured text as data

Imagine that you have data for all the deaths of all Medicare beneficiaries in the US 2000-2012 (~half a million person-years) and want to model the effect of air pollution levels on death, controlling for other factors that also affect death (such as smoking, BMI).

Concludes that levels of PM2.5 below the current standard are still harmful

what is USED to Compute?

  • Medicare data: 
    • 5 TB
    • Privacy requirements
  • Air pollution grids
    • ​50 TB
  • Statistical model from survival analysis (Cox proportional hazard, R package)
  • WorldMap (GIS) for geospatial visualizations
  • Massive computations performed on a secure cluster (1.3 years of combined runtime, ~700 CPU's, 24TB of memory, on IQSS Research Computing Environment)

source: https://www.linkedin.com/pulse/text-mining-its-applications-industry-subhajit-mukherjee

FROM TEXT TO quantiTative DATA

Consilience: a tool that enables you to quickly read, understand, categorize, and derive insights from large quantities of unstructured text.

What is used to compute?

  • On Microsoft Azure cloud
  • Using Java, Scala, Python, R
  • MangoDB (document database)
  • Apache Spark (large-scale data processing engine)
  • Elasticsearch (search and analytics engine)
  • Machine Learning Library (MLlib) for Spark
  • 6 worker VMs with 112 GB memory each (total 672 GB), 16 cores each (total 64 cores)

SKILLS, Languages, AND Tools

Computer Science

Software Engineering

Statistics

Machine Learning

Domain Expertise

Using Quantitative Methods in Today's Data-Intensive Research

Ista Zhan's Data Science Tools Workshop IQSS:  https://rawgit.com/IQSS/workshops/master/DataScienceTools/DataScienceTools.html

Serviceable

Serviceable

Popularity of Tools/languages

Robert Muenchen: http://r4stats.com/articles/popularity

Based on job postings

Based on Scholarly articles

Top: SQL, Python, Java, Hadoop, R

Top: SPSS, R, SAS, Stata

R vs PYTHON based on Job Postings

Python

R

Popularity of SPSS, Stata, SAS is decreasing compared to Python, R, and Julia.

R Packages developed at Harvard's Institute for Quantitative Social Science (IQSS)

zeligproject.org

Everyones's Statistical Software

BEST PRactices

Gary King, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science, 44, Pp. 341–355. Copy at http://j.mp/2n65duA

 

"Convert the raw results of any specific statistical procedure into expressions that:

1) convey numerically precise measurements of the quantities of greatest substantive interest, 2) include reasonable measurements of uncertainties about those estimates,
3) require little specialized knowledge to understand"

Best practices for reproducibility

  • Share data and code in open trusted repositories
  • Use persistent links from publication to data and code
  • Citation to data and code should be a standard
  • Document data, code, workflows, and computational environment
  • Use open license for your code and data

When writing methods, your code should Be:

  1. Informatively documented 
  2. Open source
  3. Comprehensible and automatically tested
  4. Developed using version control
  5. Stored in a public repository 
  6. Clearly citable

Christopher Gandrud, from IQSS Software Best Practices workshop

Thanks

 

@mercecrosas

 

Harvard's Institute for Quantitative Social Science

@IQSS

iq.harvard.edu

AES Panel

By Mercè Crosas