Introduction to Data Science

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

NCHU+UT Dallas Data Science Short Course Series

Speaker bio.

NCHU-UTD
Dual Degree Program in Data Science

Illustration: Collecting stock data 

Data: Daily COVID deaths

Wordcloud using YouTube data

Automated Machine Learning 

Analytics using Twitter data

Overview:

The certificate program is composed of short courses introducing students to data science and applications.  Each course is delivered in three hours, giving overview and survey in subfields of data science with illustrations and hands-on practices.   Students should follow pre-class instructions to prepare materials and own device before coming to class.

Whom is this course for?

  • Regular track:  any students with no to any background in data programming and statistics

  • Advanced Progress (AP) track:  students with some and intermediate backgrounds in data programming and statistical modeling

Course structure

  • In 30-minute segments

    • Lecture:

    • Workshop: Data Programming

    • Lecture 

    • Workshop: Data Collection

    • Lecture 

    • Workshop: Data Visualization and Modeling (Machine Learning)

Pre-class preparation

  • Bring own device (Windows 10 or MacOS, no tablets)
  • All software/applications used in this class are open-sourced
  • Programming in cloud platforms (RStudio cloud, Google Colab)
  • Recommended accounts: GitHub 

In the beginning.....

This introductory course is an overview of Data Science.  Students will learn:

 

  1. What is Data Science?
  2. What is Big Data?
  3. How to equip for data scientist
  4. Tools for professional data scientists

Prepare for class

Recommended software and IDE’s

  1. R version 4.x (https://cran.r-project.org)
  2. RStudio version 2023.06.x (https://posit.co/download/rstudio-desktop/)

 

Cloud websites/accounts:

  1. GitHub account (https://github.com)
  2. RStudio Cloud account (https://rstudio.cloud)

 

Optional software and IDE’s:

Text editor of own choice (e.g. Visual Studio Code, Sublime Text, Bracket)

Ask me anything!

Overview:

  1. Why Data Science?  Why now?

  2. Data fluency (vs. Data literacy)

  3. Types of Data Science

  4. Data Science Roadmap

  5. Data Programming

  6. Data Acquisition

  7. Data Visualization

Why Data Science? Why now?

Data fluency

Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"

Data fluency

Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"

Data fluency

Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"

Data fluency

Everybody has the data skills and literacy to understand and perform data driven documents and tasks

Danger of immature data fluency

Types of Data Science

  1. Business intelligence (Descriptive analytics)
  2. Machine learning (Predictive analytics)
  3. Decision making (Prescriptive analytics)

Rogati AI hierarchy of needs

Data Science Roadmap

  1. Introduction - Data theory

  2. Data methods

  3. Statistics

  4. Programming

  5. Data Visualization

  6. Information Management

  7. Data Curation

  8. Spatial Models and Methods

  9. Machine Learning

  10. NLP/Text mining

What is data?

What is data?

  1. Data generation

    1. Made data vs. Found data

    2. Structured vs. Semi/unstructured

    3. Primary vs. secondary data

    4. Derived data

      1. metadata, paradata

What is Big Data?

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

What is Big Data?

Burt Monroe (2012)

5Vs of Big data 

  • Volume

  • Variety

  • Velocity

  • Vinculation

  • Validity 

  • Programming is a practice of using programming language to design, perform and evaluate tasks using a computer.  These tasks include:
    • Computation
    • Data collection
    • Data management
    • Data visualization
    • Data modeling
  • In this course, we focus on data programming, which emphasizes programs dealing with and evolving with data.

What is programming?

Data programming

}

  • Understand the differences between apparently similar constructs in different languages
  • Be able to choose a suitable programming language for each application
  • Enhance fluency in existing languages and ability to learn new languages
  • Application development

Why learning programming Languages?

- Maribel Fernandez 2014

Language implementations

Compilation

Interpretation

  • Machine language
    • Assembly language
    • C

Low-level languages

  • BASIC
    • REALbasic
    • Visual Basic
  • C++
  • Objective-C
    • Mac
  • C#
    • Windows
  • Java

Systems languages

  • Perl
  • Tcl
  • JavaScript
  • Python

Scripting languages

DRY – Don’t Repeat Yourself

Write a function!

Function example

# Create preload function
# Check if a package is installed.
# If yes, load the library
# If no, install package and load the library

preload<-function(x)
{
  x <- as.character(x)
  if (!require(x,character.only=TRUE))
  {
    install.packages(pkgs=x,  repos="http://cran.r-project.org")
    require(x,character.only=TRUE)
  }
}
\}

Why learning programming?

learning how to program can significantly enhance how social scientists can think about their studies, and especially those premised on the collection and analysis of digital data.

   

- Brooker 2019: 

Chances are the language you learn today will quite likely not be the language you'll be using tomorrow.

What is R?

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

- Venables, Smith and the R Core team

  • array
  • interpreted
  • impure
  • interactive mode
  • list-based
  • object-oriented (prototype-based)
  • scripting

R

What is R?

  • The R statistical programming language is a free, open source package based on the S language developed by John Chambers.

  • Some history of R and S

  • S was further developed into R by Robert Gentlemen (Canada) and Ross Ihaka (NZ)

 

Source: Nick Thieme. 2018. R Generation: 25 years of R https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2018.01169.x 

What is R?

 

What is R?

It is:

  • ​Large, probably one of the largest based on the user-written add-ons/procedures

  • Object-oriented

  • Interactive

  • Multiplatform: Windows, Mac, Linux

What is R?

According to John Chambers (2009), six facets of R :

  1. an interface to computational procedures of many kinds;

  2. interactive, hands-on in real time;

  3. functional in its model of programming;

  4. object-oriented, “everything is an object”;

  5. modular, built from standardized pieces; and,

  6. collaborative, a world-wide, open-source effort.

 

Why R?

  • A programming platform environment

  • Allow development of software/packages by users

  • Currently, the CRAN package repository features 12,108 available packages (as of 1/31/2018).

  • Graphics!!!

  • Comparing R with other software?

 

Getting the software

 

Workshop I: Data Programming

RStudio

RStudio is a user interface for the statistical programming software R.

  • Object-based environment

  • Window system

  • Point and click operations

  • Coding recommended                                   

  • Expansions and development

Posit Cloud:
https://posit.cloud/content/6625059

RStudio

The script window: 
You can store a document of commands you used in R to reference later or repeat analyses 
Environment:
Lists all of the objects
Console:
Output appears here. The > sign means R is ready to accept commands. 
Plot/Help:
Plots appear in this window. You can resize the window if plots appear too small or do not fit.

RStudio

The script window: 
You can store a document of commands you used in R to reference later or repeat analyses 
Environment:
Lists all of the objects
Console:
Output appears here. The > sign means R is ready to accept commands. 
Plot/Help:
Plots appear in this window. You can resize the window if plots appear too small or do not fit.

R Programming Basics

  • R code can be entered into the command line directly or saved to a script, which can be run as a script

  • Commands are separated either by a ; or by a newline.

  • R is case sensitive.

  • The # character at the beginning of a line signifies a comment, which is not executed.

  • Help can be accessed by preceding the name of the function with ? (e.g. ?plot).

Importing data

  • Can import from SPSS, Stata and text data file
    Use a package called foreign:
    First, install.packages(“foreign”), then you can use following codes to import data:

     

mydata <- read.csv(“path”,sep=“,”,header=TRUE)
mydata.spss <- read.spss(“path”,sep=“,”,header=TRUE)
mydata.dta <- read.dta(“path”,sep=“,”,header=TRUE)

Importing data

Note:

  • R is absolutely case-sensitive

  • R uses extra backslashes to recognize path

  • Read data directly from GitHub:

happy=read.csv("https://raw.githubusercontent.com/kho7/SPDS/master/R/happy.csv")

Accessing variables

To select a column use:

mydata$column

For example:

Manipulating variables

Recoding variables

For example:

mydata$Age.rec<-recode(mydata$Age, "18:19='18to19'; 20:29='20to29';30:39='30to39'")

Getting started

  • Start with a project

  • Why?
    • File management
    • History
    • Version control using git or svn
    • Read Jenny Byran's advice
      • Start with a project and stick with it
      • Use the here package

Beware of bugs in the above code; I have only proved it correct, not tried it."

 

- Donald Knuth, author of The Art of Computer Programming

Source: https://www.frontiersofknowledgeawards-fbbva.es/version/edition_2010/

Overview

In this module, we will help you:

  • Understand data generation process in big data age

  • Learn how to collect web data and social data

  • Illustration: Open data

    • collecting stock data

    • collecting COVID data

  • Illustration: API

Data Methods

  1. Survey

  2. Experiments

  3. Qualitative Data

  4. Text Data

  5. Web Data

  6. Machine Data

  7. Complex Data

    1. Network Data

    2. Multiple-source linked Data

Made

Data

}

}

Found

Data

Data Methods

  1. Small data or Made data emphasize design

  2. Big data or Found data focus on algorithm

How Data are generated?

  • Computers

  • Web

  • Mobile devices

  • IoT (Internet of Things)

  • Further extension of human users (e.g. AI, avatars)

Web data

How do we take advantage of the web data?

  1. Purpose of web data

  2. Generation process of web data

  3. What is data of data?

  4. Why data scientists need to collect web data?

Data file formats

  • CSV (comma-separated values)

    • CSVY with metadata (YAML)
  • JSON (JavaScript Object Notation)

  • XML (Extensible Markup Language)

  • Text (ASCII)

  • Tab-delimited data

  • Proprietary formats

    • Stata
    • SPSS
    • SAS
    • Database

YAML (Yet Another Markup Language or YAML Ain't Markup Language) is a data-oriented, human readable language mostly use for configuration files)

Open data

Open data refers to the type of data usually offered by government (e.g. Census), organization or research institutions (e.g. ICPSR, Johns Hopkins Coronavirus Resource Center). Some may require an application for access and others may be open for free access (usually via websites or GitHub).

Open data

Since open data are provided by government agencies or research institutions, these data files are often:

  • Structured

  • Well documented

  • Ready for data/research functions

API

  • API stands for Application Programming Interface. It is a web service that allows interactions with, and retrieval of, structured data from a company, organization or government agency.

  • Example:

    • Social media (e.g. Facebook, YouTube, Twitter)

    • Government agency (e.g. Congress)

API

API

Like open data, data available through API are generally:

  • Structured

  • Somewhat documented

  • Not necessary fully open

  • Subject to the discretion of data providers

  • E.g. Not all variables are available, rules may change without announcements, etc.

For the type of found data not available via API or open access, one can use non-API methods to collect this kind of data.  These methods include scraping, which is to simulate web browsing but through automated scrolling and parsing to collect data. These data are usually non-structured and often times noisy.  Researchers also have little control over data generation process and sampling design.

Non-API methods

Non-API methods

Non-API data are generally:

  • Non-structured

  • Noisy

  • Undocumented with no  or little information on sampling

Illustration: Collecting stock data 

This workshop demonstrates how to collect stock data using:

  1. an R package called quantmod

Illustration: Collecting stock data 

This workshop demonstrates how to collect stock data using:

  1. an R package called quantmod

This workshop demonstrates how to collect stock data using:

Link to RStudio Cloud:

https://posit.cloud/content/6625059

- Need a GitHub and RStudio Account

Link to class GitHub:

https://github.com/datageneration/nchu

Workshop II: Data collection

Assignment 1

  1. Install R and RStudio

  2. Download the R program in class GitHub (under codes)

    1. DPR_stockdata.R

  3. Can you download TSM's (台積電) data in the last three years?

  4. Plot the TSM data in the last three years using the sample codes (plot and ggplot2)

Illustration: Collecting COVID data 

This workshop demonstrates how to collect COVID data using:

  1. API methods
    1. Johns Hopkins University Center for Systems Science and Engineering (CSSE) (map | GitHub)
    2. Our world in Data (website | GitHub)
    3. New York Times (GitHub)

Data: Total cases per million

Data: Daily COVID deaths

Data: Death data (Asia)

Data: Death data (Europe)

Data: COVID cases~predictors

Data: COVID cases~predictors

Automated Machine Learning 

Automated Machine Learning 

Automated Machine Learning 

Assignment 2

  1. Download the R program in class GitHub (under codes)

    1. DPR_coviddata1.R

  2. Can you download Taiwan and Germany COVID data in last three years?

  3. Plot the data using the sample codes (use plot and ggplot2 functions)

Assignment 3

  1. Download the R program in class GitHub (under codes)

    1. DPR_caret01.R

  2. Can you predict the chance of Tsai winning including additional variable "indep" (support for Taiwan's independence")?

  3. What is the new accuracy?  Better or worse? 

Assignment 4 (optional for AP)

  1. Download the R program in class GitHub  (under codes)

    1. DPR_tuber01.R

  2. Can you download channel and video data from “中天新聞” and “關鍵時刻”?

  3. Can you create WordClouds for selected video from each channel?

Illustration: Scraping YouTube data 

This workshop demonstrates how to collect YouTube data using:

  1. API method (with Google developer account)

Illustration: Collecting YouTube data 

This workshop demonstrates how to collect YouTube data using Google API:

Link to RStudio Cloud:

https://rstudio.cloud/project/4631380

- Need a GitHub and RStudio Account

Link to class GitHub:

https://github.com/datageneration/nchu

Wordcloud using YouTube data

Wordcloud using YouTube data

Illustration: Scraping Twitter data 

This workshop demonstrates how to collect Twitter data using:

  1. API method (with Twitter developer account)
  2. Non-API method (using Python-based twint)

 

Analytics using Twitter data

Analytics using Twitter data

Ask me anything!