Social and Political Data Science: Introduction

Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Introduction 

Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

  • 1700s Agricultural Revolution 

  • 1780 Industrial Revolution

  • 1940 Information Revolution

  • 1950s Digital Revolution

  • Knowledge Revolution

  • Data Revolution

Grolemund and Wickham: Data analytics: 

  • Hypothesis generation

  • Hypothesis confirmation

Which one goes first?

Trevor Hastie and Robert Tibshirani

Judea Pearl and Dana MacKenzie

Statistical Modeling:
The Two Cultures 

Leo Breiman 2001: Statistical Science 

One assumes that the data are generated by a given stochastic data model.
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Theory:
Data Generation Process

Data are generated in many fashions.   Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Theory:
Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.  

Theory:
Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Theory:
Data Generation Process

 Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Theory:
Data Generation Process

 Algorithmic Modeling

Unsupervised Learning

Supervised Learning         vs. 

Source: https://www.mathworks.com

Machine Learning and Conventional statistical methods

Source: Attewell, Paul A. & Monaghan, David B. 2015. Data Mining for the Social Sciences: an Introduction, Table 2.1, p.  27

Machine Learning and Conventional statistical methods

  • Statistics: testing hypotheses

  • Machine learning: finding the right hypothesis

  • Overlap:
    Decision trees (C4.5 and CART)
    Nearest-neighbor methods

  • Bridging the two:
    Most machine learning algorithms employ statistical techniques

Prediction trade-off

  • Accuracy 

  • Accuracy vs. Interpretability

  • On Interpretability:

    • Linear models are easy to interpret

    • Splines, loess are not

    • Decision on what to include?

  • Good fit versus over-fit or under-fit

    • How do we know when the fit is just right?

    • fit vs. model fit in regression

  •  

Prediction trade-off (cont'd)

  • Parsimony versus all-in model

    • "less explains more" vs. "blackbox" 

Prediction trade-off

If we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately— interpretability is not a concern.

--> Most flexible f

Parametric vs. Non-parametric models

  • Sample and population

  • Generalization

  • Representation

Supervised vs. Unsupervised Machine Learning

  • With vs. without known \(y\)

  • From statistical point of view, unsupervised machine learning:

    • Identify pattern about data
    • Seek information about \(y\) or the dependent variable

Outcome measurement \(Y\) (also called dependent variable, response, target).

• Vector of \(p\) predictor measurements \(X\) (also called inputs, regressors, covariates, features, independent variables).

 

 

 

 

 

Regression problem
\(Y\) is quantitative (income, crime rate) 

Classification problem
\(Y\) is qualitative (voted or not, survival, Democrat) 

We have training data \((x_1,y_1),...,(x_N,y_N)\). These are observations (examples, instances) of these measurements.

Linear vs. Non-linear models

  • Simplicity vs. Overfitting

  • Modeling methods

  • Dimensionality

Statistical Learning methods: 

Source: ISLR Figure 2.7, p. 25

The main goal of science is to “discover explanatory principles” of how a system actually works, and the “right approach” to achieve that goal is to:


“let the theory guide the data”

- Noam Chomsky, father of modern linguistics

“... we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have:


the unreasonable effectiveness of data.”

- Peter Norvig, Director of Research at Google

“... the bulk of human knowledge is organized around causal, not probabilistic relationships, and the grammar of probability calculus is insufficient for capturing those relationships… It is for this reason that I consider myself only a half-Bayesian.”

- Judea Pearl, author of the book of why

Study the science of art. Study the art of Science. Develop your senses - especially learn how to see. Realize that everything connects to everything else.

- Leonardo da Vinci (1452 - 1519)

Common ground:

 

 

Statistical learning methods without understanding the variables tend to yield better predictions than the theoretical approach that attempts to model how the variables relate to each other.

Common ground:

 

Statistical learning methods without understanding the variables tend to yield better predictions than the theoretical approach that attempts to model how the variables relate to each other.

Everyone is a teacher because the ability to teach anything connects to the ability to learn what is being taught.

- John Sibert

Learn like you are to teach.

Q & A

Question: Why R?

Answer: There are many software options including Python, RapidMiner, Weka, SAS JMP and SPSS.  R is most accessible and not proprietary to specific method.

Question: Can I take other courses?

Answer: Yes, recommended GISC6323 Machine Learning for Socio-Economic and Geo-Referenced Data by Dr. Michael Tiefelsdorf. Online: DataCamp.

Question: Advanced Math is prerequisite?

Answer: No, but that will help.  Recommended: A Mathematics Course for Political and Social Research, by Will H. Moore and David A. Siegel

Question: Why only focus on prediction? What about inference?

Answer: New developments in Data science actually is putting more emphasis on inference.  This course is designed to bridge the two.