Data Science For All: New Visions in Data Science Revolution

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Introduction to course

Objectives:

By the end of this module, students will be able to:

1. Be familiar with the scope and methods in Data Science
2. Understand statistical concepts in small data and big data research
3. Be knowledgeable about data collection, data production and open data
4. Recognize the limitations of Data Science
5. Be introduced to the latest trends in Data Science Revolution

"Knowledge itself is Power."

Power=f(Size_{Knowledge},Veracity_{Knowledge},Speed_{Knowledge})
$Power=f(Size_{Knowledge},Veracity_{Knowledge},Speed_{Knowledge})$

Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

What is Data Science?

1. Science of Data

2. Understand Data Scientifically

How Data are generated?

The size of the digital universe will double every two years at least.

- InsideBigdata.com

Data Literacy

1. Data generating process
2. Graphic grammar
3. Statistical judgement

Data Literacy

1. Data generating process
1. ​How data are generated
2. Distribution
3. Missing values
4. Wrong data

Data Literacy

1. Graphic grammar
1. Bad charts deliver incorrect message
2. Poor design
3. Color
4. Label
5. Scale

Data Literacy

1. Statistical understanding
1. Size does (not) matter
2. Representativeness does
3. Forecast/prediction minded
4. Explanation

Data Literacy

1. Why we need numeric data?
2. History of data

Statistical Modeling: The Two Cultures

Leo Breiman 2001: Statistical Science

One assumes that the data are generated by a given stochastic data model.
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Theory: Data Generation Process

Data are generated in many fashions.   Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Theory: Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.

Theory: Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Theory: Data Generation Process

Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Theory: Data Generation Process

Algorithmic Modeling

Unsupervised Learning

Supervised Learning         vs.

Source: https://www.mathworks.com

Hans Rosling

Swedish physician and statistician

• Founded Gapminder Foundation
• Visualize historical data on public health and poverty

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals