Data Science For All: New Visions in Data Science Revolution

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Data Science For All: New Visions in Data Science Revolution

  1. Introduction to course

  2. Data Science Fundamentals

  3. Two cultures of Statistics

  1. Big data and small data: actions, transactions and interactions  

  2. Data Made, Data Found

  3. Open data and API

Introduction to course

Objectives:
 

By the end of this module, students will be able to:

  1. Be familiar with the scope and methods in Data Science
  2. Understand statistical concepts in small data and big data research
  3. Be knowledgeable about data collection, data production and open data
  4. Recognize the limitations of Data Science
  5. Be introduced to the latest trends in Data Science Revolution

- Sir Francis Bacon

"ipsa scientia potestas est"

"Knowledge itself is Power."

Power=f(Size_{Knowledge},Veracity_{Knowledge},Speed_{Knowledge})
Power=f(SizeKnowledge,VeracityKnowledge,SpeedKnowledge)Power=f(Size_{Knowledge},Veracity_{Knowledge},Speed_{Knowledge})

Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

"Knowledge is data."

"Data is power."

  • 1700s Agricultural Revolution 

  • 1780 Industrial Revolution

  • 1940 Information Revolution

  • 1950s Digital Revolution

  • Knowledge Revolution

  • Data Revolution

Data Science Fundamentals

  • What is data?

  • What is data science?

  • How data is generated?

  • Data literacy

What is Data?

Data is everything.


 

  • Data is ever growing......

    • Moore's Law

    • Parkinson's Law

 

 

What is Big Data?

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

What is Big Data?

Burt Monroe (2012)

5Vs of Big data 

  • Volume

  • Variety

  • Velocity

  • Vinculation

  • Validity 

  • Prediction-explanation gap

  • Induction-deduction gap

  • Bigness-representativeness gap

  • Data access gap

What is Data Science?

What is Data Science?

  1. Science of Data

  2. Understand Data Scientifically

Data Science Keywords

  • Data management

  • Data analytics

  • Data scientists

  • Data curation

  • Modeling

  • CRMs 

How Data are generated?

  • Computers

  • Web

  • Mobile devices

  • IoT (Internet of Things)

  • Further extension of human users (e.g. AI, avatars)

How Data are generated?

How Data are generated?

"Data Lake" Ubiquitous

Massive raw data repository in its rawest form pending processing.

Data Literacy

  1. Data generating process
  2. Graphic grammar
  3. Statistical judgement

 

Data Literacy

  1. Data generating process
    1. ​How data are generated
    2. Distribution
    3. Missing values
    4. Wrong data

 

Data Literacy

  1. Graphic grammar
    1. Bad charts deliver incorrect message
    2. Poor design
    3. Color
    4. Label
    5. Scale

Data Literacy

  1. Statistical understanding
    1. Size does (not) matter
    2. Representativeness does
    3. Forecast/prediction minded
    4. Explanation

Data Literacy

  1. Why we need numeric data?
  2. History of data

Darkest hour: Churchill and typist

Two cultures of Statistics

  • Data model

  • Algorithm model

Statistical Modeling:
The Two Cultures 

Leo Breiman 2001: Statistical Science 

One assumes that the data are generated by a given stochastic data model.
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Theory:
Data Generation Process

Data are generated in many fashions.   Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Theory:
Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.  

Theory:
Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Theory:
Data Generation Process

 Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Theory:
Data Generation Process

 Algorithmic Modeling

Unsupervised Learning

Supervised Learning         vs. 

Source: https://www.mathworks.com

Let the dataset change your mindset.

 

- Hans Rosling

Hans Rosling

Swedish physician and statistician

  • Founded Gapminder Foundation
  • Visualize historical data on public health and poverty

 

- Hal Varian

The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades.

Data Made, Data Found

  • Small data

  • Big data

Big data and small data:

actions

individuals

interactions

transactions

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

individuals

Open data and API

  • What is open data?

  • Application Program Interface (API)

    • Google

    • Twitter

    • Government

Data Generation

Data Methods

  1. Survey

  2. Experiments

  3. Qualitative Data

  4. Text Data

  5. Web Data

  6. Machine Data

  7. Complex Data

    1. Network Data

    2. Multiple-source linked Data

Made

Data

}

}

Found

Data

Quick Analytics:
Taiwan Climate

Quick Analytics:
Taiwan Climate

Spatial Data: United States

Spatial Data: United States

Java: D3 Library

Sentiment Analysis