TOOLS and BEST PRACTICES in QUANTItative Research
Mercè Crosas, IQSS, Harvard University
@mercecrosas
XXXVII Jornadas de Economía de la Salud,
Barcelona, 6-8 SePtiembre
Some Examples
OF WHAT you can do using
QUANTITATIVE DATA and METHODS
- quantitative and geospatial data
- unstructured text as data
Imagine that you have data for all the deaths of all Medicare beneficiaries in the US 2000-2012 (~half a million person-years) and want to model the effect of air pollution levels on death, controlling for other factors that also affect death (such as smoking, BMI).
Concludes that levels of PM2.5 below the current standard are still harmful
what is USED to Compute?
- Medicare data:
- 5 TB
- Privacy requirements
- Air pollution grids
- 50 TB
- Statistical model from survival analysis (Cox proportional hazard, R package)
- WorldMap (GIS) for geospatial visualizations
- Massive computations performed on a secure cluster (1.3 years of combined runtime, ~700 CPU's, 24TB of memory, on IQSS Research Computing Environment)
source: https://www.linkedin.com/pulse/text-mining-its-applications-industry-subhajit-mukherjee
FROM TEXT TO quantiTative DATA
Consilience: a tool that enables you to quickly read, understand, categorize, and derive insights from large quantities of unstructured text.
What is used to compute?
- On Microsoft Azure cloud
- Using Java, Scala, Python, R
- MangoDB (document database)
- Apache Spark (large-scale data processing engine)
- Elasticsearch (search and analytics engine)
- Machine Learning Library (MLlib) for Spark
- 6 worker VMs with 112 GB memory each (total 672 GB), 16 cores each (total 64 cores)
SKILLS, Languages, AND Tools
Computer Science
Software Engineering
Statistics
Machine Learning
Domain Expertise
Using Quantitative Methods in Today's Data-Intensive Research
Ista Zhan's Data Science Tools Workshop IQSS: https://rawgit.com/IQSS/workshops/master/DataScienceTools/DataScienceTools.html
Serviceable
Serviceable
Popularity of Tools/languages
Robert Muenchen: http://r4stats.com/articles/popularity
Based on job postings
Based on Scholarly articles
Top: SQL, Python, Java, Hadoop, R
Top: SPSS, R, SAS, Stata
R vs PYTHON based on Job Postings
Python
R
Popularity of SPSS, Stata, SAS is decreasing compared to Python, R, and Julia.
R Packages developed at Harvard's Institute for Quantitative Social Science (IQSS)
zeligproject.org
Everyones's Statistical Software
BEST PRactices
Gary King, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science, 44, Pp. 341–355. Copy at http://j.mp/2n65duA
"Convert the raw results of any specific statistical procedure into expressions that:
1) convey numerically precise measurements of the quantities of greatest substantive interest, 2) include reasonable measurements of uncertainties about those estimates,
3) require little specialized knowledge to understand"
Best practices for reproducibility
- Share data and code in open trusted repositories
- Use persistent links from publication to data and code
- Citation to data and code should be a standard
- Document data, code, workflows, and computational environment
- Use open license for your code and data
When writing methods, your code should Be:
- Informatively documented
- Open source
- Comprehensible and automatically tested
- Developed using version control
- Stored in a public repository
- Clearly citable
Christopher Gandrud, from IQSS Software Best Practices workshop
Thanks
@mercecrosas
Harvard's Institute for Quantitative Social Science
@IQSS
iq.harvard.edu
AES Panel
By Mercè Crosas
AES Panel
- 1,771