Data Acquisition

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Data Programming with R

Overview

In this module, we will help you:

Understand data generation process in big data age
Learn how to collect web data and social data
Illustration: Open data
- collecting stock data
- collecting COVID data
Illustration: API

What is data?

Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

What is data?

Kinds of Data
1. Quantitative vs. Qualitative
2. Structured vs. Semi/unstructured
3. Measurement
  - Nominal/ordinal/interval/ratio

What is data?

Data generation
1. Made data vs. Found data
2. Structured vs. Semi/unstructured
3. Primary vs. secondary data
4. Derived data
  1. metadata, paradata

What is Big Data?

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

What is Big Data?

Burt Monroe (2012)

5Vs of Big data

Volume
Variety
Velocity
Vinculation
Validity

The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

The findings were published in the top journal Nature in 2008. However, shortly GFT failed and missed at the peak of the 2013 flu season by 140 percent.

The story of Google Flu Trend

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Google should have highest power in data access .

Why would it fail?

Why would it not fail yet?

Power=f(Data_{Size},Data_{Veracity},Data_{Speed})

Power=f(Data_{Veracity},Data_{Speed},Data_{Size})

Size still matters, but not first.

Data Methods

Survey
Experiments
Qualitative Data
Text Data
Web Data
Machine Data
Complex Data
1. Network Data
2. Multiple-source linked Data

Made

Data

}

Found

Data

Data Methods

Small data or Made data emphasize design
Big data or Found data focus on algorithm

How Data are generated?

Computers
Web
Mobile devices
IoT (Internet of Things)
Further extension of human users (e.g. AI, avatars)

Web data

How do we take advantage of the web data?

Purpose of web data
Generation process of web data
What is data of data?
Why data scientists need to collect web data?

Data file formats

CSV (comma-separated values)
- CSVY with metadata (YAML)
JSON (JavaScript Object Notation)
XML (Extensible Markup Language)
Text (ASCII)
Tab-delimited data
Proprietary formats
- Stata
- SPSS
- SAS
- Database

YAML (Yet Another Markup Language or YAML Ain't Markup Language) is a data-oriented, human readable language mostly use for configuration files)

Open data

Open data refers to the type of data usually offered by government (e.g. Census), organization or research institutions (e.g. ICPSR, Johns Hopkins Coronavirus Resource Center). Some may require an application for access and others may be open for free access (usually via websites or GitHub).

Open data

Since open data are provided by government agencies or research institutions, these data files are often:

Structured
Well documented
Ready for data/research functions

API

API stands for Application Programming Interface. It is a web service that allows interactions with, and retrieval of, structured data from a company, organization or government agency.
Example:
- Social media (e.g. Facebook, YouTube, Twitter)
- Government agency (e.g. Congress)

APIs can take many different forms and be of varying quality and usefulness.
RESTful API (Representational State Transfer) is a means of transferring data using web protocols
Example:
- Crossref API
  http://api.crossref.org/works/10.1093/nar/gni170
- Taiwan Legislative Yuan API
  https://www.ly.gov.tw/WebAPI/LegislativeBill.aspx?from=1050201&to=1050531&proposer=&mode=json

API

Like open data, data available through API are generally:

Structured
Somewhat documented
Not necessary fully open
Subject to the discretion of data providers
E.g. Not all variables are available, rules may change without announcements, etc.

For the type of found data not available via API or open access, one can use non-API methods to collect this kind of data. These methods include scraping, which is to simulate web browsing but through automated scrolling and parsing to collect data. These data are usually non-structured and often times noisy. Researchers also have little control over data generation process and sampling design.

Non-API methods

Non-API data are generally:

Non-structured
Noisy
Undocumented with no or little information on sampling

Illustration: Collecting stock data

This workshop demonstrates how to collect stock data using:

an R package called quantmod

Illustration: Collecting stock data

This workshop demonstrates how to collect stock data using:

an R package called quantmod

Illustration: Collecting stock data

This workshop demonstrates how to collect stock data using:

Link to RStudio Cloud:

https://rstudio.cloud/project/4631380

- Need a GitHub and RStudio Account

Link to class GitHub:

https://github.com/datageneration/dataprogrammingwithr

Illustration: Collecting COVID data

This workshop demonstrates how to collect COVID data using:

API methods
1. Johns Hopkins University Center for Systems Science and Engineering (CSSE) (map | GitHub)
2. Our world in Data (website | GitHub)
3. New York Times (GitHub)

Data: Total cases per million

Data: Daily COVID deaths

Data: Death data (Asia)

Data: Death data (Europe)

Data: COVID cases~predictors

Automated Machine Learning

Illustration: Scraping Twitter data

This workshop demonstrates how to collect Twitter data using:

API method (with Twitter developer account)
Non-API method (using Python-based twint)

Notebook

Data Acquisition

Data Programming with R

Overview

In this module, we will help you:

Understand data generation process in big data age

Learn how to collect web data and social data

Illustration: Open data

collecting stock data

collecting COVID data

Illustration: API

What is data?

What is data?

What is data?

Data generation

Made data vs. Found data

Structured vs. Semi/unstructured

Primary vs. secondary data

Derived data

What is Big Data?

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

What is Big Data?

Burt Monroe (2012)

5Vs of Big data

Volume

Variety

Velocity

Vinculation

Validity

The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

The findings were published in the top journal Nature in 2008. However, shortly GFT failed and missed at the peak of the 2013 flu season by 140 percent.

The story of Google Flu Trend

Lazer, Kennedy, King and Vespignani (2014)

Lazer, Kennedy, King and Vespignani (2014)

Google should have highest power in data access .

Why would it fail?

Why would it not fail yet?

Size still matters, but not first.

Data Methods

Survey

Experiments

Qualitative Data

Text Data

Web Data

Machine Data

Complex Data

Network Data

Multiple-source linked Data

Made

Data

}

}

Found

Data

Data Methods

Small data or Made data emphasize design

Big data or Found data focus on algorithm

How Data are generated?

Computers

Web

Mobile devices

IoT (Internet of Things)

Further extension of human users (e.g. AI, avatars)

Web data

How do we take advantage of the web data?

Purpose of web data

Generation process of web data

What is data of data?

Why data scientists need to collect web data?

Data file formats

CSV (comma-separated values)

JSON (JavaScript Object Notation)

XML (Extensible Markup Language)

Text (ASCII)

Tab-delimited data

Proprietary formats

Open data

Open data refers to the type of data usually offered by government (e.g. Census), organization or research institutions (e.g. ICPSR, Johns Hopkins Coronavirus Resource Center). Some may require an application for access and others may be open for free access (usually via websites or GitHub).

Open data