Data Acquisition

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Data Programming with R

Overview

In this module, we will help you:

  • Understand data generation process in big data age

  • Learn how to collect web data and social data

  • Illustration: Open data

    • collecting stock data

    • collecting COVID data

  • Illustration: API

What is data?

Ackoff, R.L., 1989. From data to wisdom. Journal of applied systems analysis, 16(1), pp.3-9.

What is data?

  1. Kinds of Data

    1. Quantitative vs. Qualitative

    2. Structured vs. Semi/unstructured

    3. Measurement

      • Nominal/ordinal/interval/ratio

What is data?

  1. Data generation

    1. Made data vs. Found data

    2. Structured vs. Semi/unstructured

    3. Primary vs. secondary data

    4. Derived data

      1. metadata, paradata

What is Big Data?

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

What is Big Data?

Burt Monroe (2012)

5Vs of Big data 

  • Volume

  • Variety

  • Velocity

  • Vinculation

  • Validity 

The story of Google Flu Trend

By using Big Data of search queries, Google Flu Trend (GFT) predicted the flu-like illness rate in a population.

The findings were published in the top journal Nature in 2008.  However, shortly GFT failed and missed at the peak of the 2013 flu season by 140 percent.  

The story of Google Flu Trend

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Lazer, Kennedy, King and Vespignani (2014)

Traditional “small data” often offer information that is not contained (or containable) in big data, and "by combining GFT and lagged [traditional] CDC data, as well as dynamically recalibrating GFT... can substantially improve on the performance of GFT or the CDC alone. " (Lazer et al. 2014 Science)

Google should have highest power in data access .  

Why would it fail?

Why would it not fail yet?

Power=f(Data_{Size},Data_{Veracity},Data_{Speed})
Power=f(Data_{Veracity},Data_{Speed},Data_{Size})

Size still matters, but not first.

Data Methods

  1. Survey

  2. Experiments

  3. Qualitative Data

  4. Text Data

  5. Web Data

  6. Machine Data

  7. Complex Data

    1. Network Data

    2. Multiple-source linked Data

Made

Data

}

}

Found

Data

Data Methods

  1. Small data or Made data emphasize design

  2. Big data or Found data focus on algorithm

How Data are generated?

  • Computers

  • Web

  • Mobile devices

  • IoT (Internet of Things)

  • Further extension of human users (e.g. AI, avatars)

Web data

How do we take advantage of the web data?

  1. Purpose of web data

  2. Generation process of web data

  3. What is data of data?

  4. Why data scientists need to collect web data?

Data file formats

  • CSV (comma-separated values)

    • CSVY with metadata (YAML)
  • JSON (JavaScript Object Notation)

  • XML (Extensible Markup Language)

  • Text (ASCII)

  • Tab-delimited data

  • Proprietary formats

    • Stata
    • SPSS
    • SAS
    • Database

YAML (Yet Another Markup Language or YAML Ain't Markup Language) is a data-oriented, human readable language mostly use for configuration files)

Open data

Open data refers to the type of data usually offered by government (e.g. Census), organization or research institutions (e.g. ICPSR, Johns Hopkins Coronavirus Resource Center). Some may require an application for access and others may be open for free access (usually via websites or GitHub).

Open data

Since open data are provided by government agencies or research institutions, these data files are often:

  • Structured

  • Well documented

  • Ready for data/research functions

API

  • API stands for Application Programming Interface. It is a web service that allows interactions with, and retrieval of, structured data from a company, organization or government agency.

  • Example:

    • Social media (e.g. Facebook, YouTube, Twitter)

    • Government agency (e.g. Congress)

API

API

Like open data, data available through API are generally:

  • Structured

  • Somewhat documented

  • Not necessary fully open

  • Subject to the discretion of data providers

  • E.g. Not all variables are available, rules may change without announcements, etc.

For the type of found data not available via API or open access, one can use non-API methods to collect this kind of data.  These methods include scraping, which is to simulate web browsing but through automated scrolling and parsing to collect data. These data are usually non-structured and often times noisy.  Researchers also have little control over data generation process and sampling design.

Non-API methods

Non-API methods

Non-API data are generally:

  • Non-structured

  • Noisy

  • Undocumented with no  or little information on sampling

Illustration: Collecting stock data 

This workshop demonstrates how to collect stock data using:

  1. an R package called quantmod

Illustration: Collecting stock data 

This workshop demonstrates how to collect stock data using:

  1. an R package called quantmod

Illustration: Collecting stock data 

This workshop demonstrates how to collect stock data using:

Link to RStudio Cloud:

https://rstudio.cloud/project/4631380

- Need a GitHub and RStudio Account

Link to class GitHub:

https://github.com/datageneration/dataprogrammingwithr

 

Illustration: Collecting COVID data 

This workshop demonstrates how to collect COVID data using:

  1. API methods
    1. Johns Hopkins University Center for Systems Science and Engineering (CSSE) (map | GitHub)
    2. Our world in Data (website | GitHub)
    3. New York Times (GitHub)

Data: Total cases per million

Data: Daily COVID deaths

Data: Death data (Asia)

Data: Death data (Europe)

Data: COVID cases~predictors

Data: COVID cases~predictors

Automated Machine Learning 

Automated Machine Learning 

Automated Machine Learning 

Illustration: Scraping Twitter data 

This workshop demonstrates how to collect Twitter data using:

  1. API method (with Twitter developer account)
  2. Non-API method (using Python-based twint)

 

Analytics using Twitter data

Analytics using Twitter data

Data Programming with R: Data Acquisition

By Karl Ho

Data Programming with R: Data Acquisition

  • 171