Social and Political Data Science: Introduction

Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas


Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

  • 1700s Agricultural Revolution 

  • 1780 Industrial Revolution

  • 1940 Information Revolution

  • 1950s Digital Revolution

  • Knowledge Revolution

  • Data Revolution

Grolemund and Wickham: Data analytics: 

  • Hypothesis generation

  • Hypothesis confirmation

Which one goes first?

Trevor Hastie and Robert Tibshirani

Judea Pearl and Dana MacKenzie

Statistical Modeling:
The Two Cultures 

Leo Breiman 2001: Statistical Science 

One assumes that the data are generated by a given stochastic data model.
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Data Generation Process

Data are generated in many fashions.   Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.  

Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Data Generation Process

 Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Data Generation Process

 Algorithmic Modeling

Unsupervised Learning

Supervised Learning         vs. 


Everyone is a teacher because the ability to teach anything connects to the ability to learn what is being taught.

- John Sibert

Learn like you are to teach.

Q & A

Question: Why R?

Answer: There are many software/platform options including Python, Weka, SAS JMP and SPSS.  R is most accessible and not proprietary to specific method.  Coupled with other features and systems (e.g. visualization and parallel processing), it facilitates data programming and presentation in a coherent ecosystem.

Question: Can I take other courses?

Answer: Yes, recommended GISC6323 Machine Learning for Socio-Economic and Geo-Referenced Data by Dr. Michael Tiefelsdorf. Online: DataCamp.

Question: Advanced Math is prerequisite?

Answer: No, but that will help.  Recommended: A Mathematics Course for Political and Social Research, by Will H. Moore and David A. Siegel

Question: Why only focus on prediction? What about inference?

Answer: New developments in Data science actually is putting more emphasis on inference.  This course is designed to bridge the two.