Teaching Data Science with AI

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Academia Sinica Workshop

Speaker bio.

NCHU-UTD
Dual Degree Program in Data Science

UTD Partnerships in Taiwan (EPPS)

  • NCHU: MPA DDP
  • NCCU: Diplomacy (in progress)
  • NTU SPE: Student Exchange Mobility

Data: Daily COVID deaths

Wordcloud using YouTube data

Automated Machine Learning 

Analytics using Twitter data

Illustration: Collecting stock data 

Overview:

The rapid evolution of data science and artificial intelligence (AI) has reshaped the research landscape, creating new opportunities for innovation in teaching research methods. This workshop, titled “Teaching Data Science with AI” aims to introduce to educators knowledge and tools to integrate data science and AI elements into their courses effectively.

In the beginning.....

This introductory course is an overview of Data Science.  Students will learn:

 

  1. What is Data Science?
  2. What is Big Data?
  3. How to equip for data scientist
  4. Tools for professional data scientists

Prepare for class

Recommended software and IDE’s

  1. R version 4.x (https://cran.r-project.org)
  2. RStudio version Version 2024.12.1+563 (https://posit.co/download/rstudio-desktop/)
  3. Python 3.10.x (Anaconda recommended)

Cloud websites/accounts:

  1. GitHub account (https://github.com)
  2. RStudio Cloud account (https://rstudio.cloud)

Optional software and IDE’s:

Text editor of own choice (e.g. Visual Studio Code)

Ask me anything!

Overview:

  1. Why Data Science?  Why now?

  2. Data fluency (vs. Data literacy)

  3. Types of Data Science

  4. Data Science Roadmap

  5. Data Programming

  6. Data Acquisition

  7. Data Visualization

Why Data Science? Why now?

Data fluency

Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"

Data fluency

Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"

Data fluency

Everybody has the data skills and literacy to understand and perform data driven documents and tasks

Danger of immature data fluency

Types of Data Science

  1. Business intelligence (Descriptive analytics)
  2. Machine learning (Predictive analytics)
  3. Decision making (Prescriptive analytics)

Rogati AI hierarchy of needs

Rogati AI hierarchy of needs

Data Science Roadmap

  1. Introduction - Data theory

  2. Data methods

  3. Statistics

  4. Programming

  5. Data Visualization

  6. Information Management

  7. Data Curation

  8. Spatial Models and Methods

  9. Machine Learning

  10. NLP/Text mining

What is data?

What is data?

  1. Data generation

    1. Made data vs. Found data

    2. Structured vs. Semi/unstructured

    3. Primary vs. secondary data

    4. Derived data

      1. metadata, paradata

What is Big Data?

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

The Big data is about data that has huge volume, cannot be on one computer. Has a lot of variety in data types, locations, formats and form. It is also getting created very very fast (velocity) (Doug Laney 2001).

What is Big Data?

Burt Monroe (2012)

5Vs of Big data 

  • Volume

  • Variety

  • Velocity

  • Vinculation

  • Validity 

  • Programming is a practice of using programming language to design, perform and evaluate tasks using a computer.  These tasks include:
    • Computation
    • Data collection
    • Data management
    • Data visualization
    • Data modeling
  • In this course, we focus on data programming, which emphasizes programs dealing with and evolving with data.

What is programming?

Data programming

}

  • Understand the differences between apparently similar constructs in different languages
  • Be able to choose a suitable programming language for each application
  • Enhance fluency in existing languages and ability to learn new languages
  • Application development

Why learning programming Languages?

- Maribel Fernandez 2014

Language implementations

Compilation

Interpretation

  • Machine language
    • Assembly language
    • C

Low-level languages

  • BASIC
    • REALbasic
    • Visual Basic
  • C++
  • Objective-C
    • Mac
  • C#
    • Windows
  • Java

Systems languages

  • Perl
  • Tcl
  • JavaScript
  • Python

Scripting languages

DRY – Don’t Repeat Yourself

Write a function!

Function example

# Create preload function
# Check if a package is installed.
# If yes, load the library
# If no, install package and load the library

preload<-function(x)
{
  x <- as.character(x)
  if (!require(x,character.only=TRUE))
  {
    install.packages(pkgs=x,  repos="http://cran.r-project.org")
    require(x,character.only=TRUE)
  }
}
\}

Why learning programming?

learning how to program can significantly enhance how social scientists can think about their studies, and especially those premised on the collection and analysis of digital data.

   

- Brooker 2019: 

Chances are the language you learn today will quite likely not be the language you'll be using tomorrow.

What is R?

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

- Venables, Smith and the R Core team

  • array
  • interpreted
  • impure
  • interactive mode
  • list-based
  • object-oriented (prototype-based)
  • scripting

R

What is R?

  • The R statistical programming language is a free, open source package based on the S language developed by John Chambers.

  • Some history of R and S

  • S was further developed into R by Robert Gentlemen (Canada) and Ross Ihaka (NZ)

 

Source: Nick Thieme. 2018. R Generation: 25 years of R https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2018.01169.x 

What is R?

 

What is R?

It is:

  • ​Large, probably one of the largest based on the user-written add-ons/procedures

  • Object-oriented

  • Interactive

  • Multiplatform: Windows, Mac, Linux

What is R?

According to John Chambers (2009), six facets of R :

  1. an interface to computational procedures of many kinds;

  2. interactive, hands-on in real time;

  3. functional in its model of programming;

  4. object-oriented, “everything is an object”;

  5. modular, built from standardized pieces; and,

  6. collaborative, a world-wide, open-source effort.

 

Why R?

  • A programming platform environment

  • Allow development of software/packages by users

  • Currently, the CRAN package repository features 12,108 available packages (as of 1/31/2018).

  • Graphics!!!

  • Comparing R with other software?

 

Getting the software

 

What is Python?

  • Interpreted high level computer language

  • Invented by Dutch programmer Guido van Rossum

  • Named after the TV Show Monty Python's Flying Circus

  • Open sourced programming language

 

 

 

  • Python history blog (http://python-history.blogspot.com)
  • First implementation: 1989
  • Centrum Wiskunde & Informatica (CWI) in the Netherlands
  • managed by the not-for-profit Python Software Foundation  launched in March 2001.
  • Responsible for various processes within the Python community, including developing the core Python distribution,
  • Developer conferences including PyCon.

History

  • Flexibility and Simplicity - easy to learn.

  • Community providing a more standard
    programming language

  • Suitability - higher level of abstraction than alternative languages traditionally used.

  • Multi-platform: Windows, MacOS and Linux
    Libraries - modules that can be used to extend
    the basic features of the language.

  • Free .... and stable

Why Python?

Hunt 2019

  • Borrow ideas from elsewhere whenever it makes sense.
  • “Things should be as simple as possible, but no simpler.” (Einstein)
  • Do one thing well (The "UNIX philosophy").
  • Don’t fret too much about performance--plan to optimize later when needed.
  • Don’t fight the environment and go with the flow.
  • Don’t try for perfection because “good enough” is often just that.
  • (Hence) it’s okay to cut corners sometimes, especially if you can do it right later.

Design Philosophy

  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.
  • Special cases aren't special enough to break the rules.
  • Although practicality beats purity.
  • Errors should never pass silently.
  • Unless explicitly silenced.
  • In the face of ambiguity, refuse the temptation to guess.
  • There should be one-- and preferably only one --obvious way to do it.
  • Although that way may not be obvious at first unless you're Dutch.
  • Now is better than never.
  • Although never is often better than right now.
  • If the implementation is hard to explain, it's a bad idea.
  • If the implementation is easy to explain, it may be a good idea.
  • Namespaces are one honking great idea -- let's do more of those!

“A large complex system should have multiple levels of extensibility. This maximizes the opportunities for users, sophisticated or not, to help themselves.”

- Guido van Rossum

Resources

  • Python.org
  • Anaconda distribution
    • Miniconda gives the Python interpreter, with a command-line tool called conda which operates as a cross-platform package manager geared toward Python packages
    • Anaconda includes both Python and conda, and additionally bundles a suite of other pre-installed packages geared toward scientific computing such as Jupyter Notebook, Spyder and Orange.

Installations

  • Command line
  • iPython
  • IDE's

Running Python

IDE's

Choice of Integrated Desktop Environment matters!

There are plenty of IDE available for python programming and developments.  To name a few:

  1. PyCharm
  2. Visual Studio Code
  3. IDLE
  4. Jupyter Notebook
  5. Google Colab (notebook on Google cloud)
  6. Spyder
  7. Rodeo (RStudio look alike)
  8. RStudio (with reticulate)

Interpreter

Python is a script language, which means that your code is converted to machine code by a Python interpreter.

Therefore, choosing which interpreter to use for a project is an important decision

Interpreter

Python 2.7 is behind most production applications.  However, it is reaching deprecation by 2020!

If you are new, it will not hurt to learn from 2.7.  Yet, use 3.x!!

Python 2.7 and 3.x

  • Python 2 was launched in 2000 and is still popular.

  • Python 3 was launched in 2008 and is not backward compatible with Python 2.7.

  • MacOS has built-in Python 2.7.

  • Recommended: Install Python3 but not replacing 2.7.

  • Check version:

python -V
python --version

Python must-haves

  • NumPy - manipulation of homogeneous array-based data

  • Pandas - manipulation of heterogeneous and labeled data

  • SciPy - for common scientific computing tasks

  • Matplotlib - data visualizations

  • Scikit-Learn - machine learning

Workshop: Data Programming with GenAI

RStudio

RStudio is a user interface for the statistical programming software R.

  • Object-based environment

  • Window system

  • Point and click operations

  • Coding recommended                                   

  • Expansions and development

Posit Cloud:
https://posit.cloud/content/6625059

RStudio

The script window: 
You can store a document of commands you used in R to reference later or repeat analyses 
Environment:
Lists all of the objects
Console:
Output appears here. The > sign means R is ready to accept commands. 
Plot/Help:
Plots appear in this window. You can resize the window if plots appear too small or do not fit.

RStudio

The script window: 
You can store a document of commands you used in R to reference later or repeat analyses 
Environment:
Lists all of the objects
Console:
Output appears here. The > sign means R is ready to accept commands. 
Plot/Help:
Plots appear in this window. You can resize the window if plots appear too small or do not fit.

R Programming Basics

  • R code can be entered into the command line directly or saved to a script, which can be run as a script

  • Commands are separated either by a ; or by a newline.

  • R is case sensitive.

  • The # character at the beginning of a line signifies a comment, which is not executed.

  • Help can be accessed by preceding the name of the function with ? (e.g. ?plot).

Importing data

  • Can import from SPSS, Stata and text data file
    Use a package called foreign:
    First, install.packages(“foreign”), then you can use following codes to import data:

     

mydata <- read.csv(“path”,sep=“,”,header=TRUE)
mydata.spss <- read.spss(“path”,sep=“,”,header=TRUE)
mydata.dta <- read.dta(“path”,sep=“,”,header=TRUE)

Importing data

Note:

  • R is absolutely case-sensitive

  • R uses extra backslashes to recognize path

  • Read data directly from GitHub:

happy=read.csv("https://raw.githubusercontent.com/kho7/SPDS/master/R/happy.csv")

Accessing variables

To select a column use:

mydata$column

For example:

Manipulating variables

Recoding variables

For example:

mydata$Age.rec<-recode(mydata$Age, "18:19='18to19'; 20:29='20to29';30:39='30to39'")

Getting started

  • Start with a project

  • Why?
    • File management
    • History
    • Version control using git or svn
    • Read Jenny Byran's advice
      • Start with a project and stick with it
      • Use the here package

Beware of bugs in the above code; I have only proved it correct, not tried it."

 

- Donald Knuth, author of The Art of Computer Programming

Source: https://www.frontiersofknowledgeawards-fbbva.es/version/edition_2010/

Overview

In this module, we will help you:

  • Understand data generation process in big data age

  • Learn how to collect web data and social data

  • Illustration: Open data

    • collecting stock data

    • collecting COVID data

  • Illustration: API

Data Methods

  1. Survey

  2. Experiments

  3. Qualitative Data

  4. Text Data

  5. Web Data

  6. Machine Data

  7. Complex Data

    1. Network Data

    2. Multiple-source linked Data

Made

Data

}

}

Found

Data

Data Methods

  1. Small data or Made data emphasize design

  2. Big data or Found data focus on algorithm

How Data are generated?

  • Computers

  • Web

  • Mobile devices

  • IoT (Internet of Things)

  • Further extension of human users (e.g. AI, avatars)

Web data

How do we take advantage of the web data?

  1. Purpose of web data

  2. Generation process of web data

  3. What is data of data?

  4. Why data scientists need to collect web data?

Data file formats

  • CSV (comma-separated values)

    • CSVY with metadata (YAML)
  • JSON (JavaScript Object Notation)

  • XML (Extensible Markup Language)

  • Text (ASCII)

  • Tab-delimited data

  • Proprietary formats

    • Stata
    • SPSS
    • SAS
    • Database

YAML (Yet Another Markup Language or YAML Ain't Markup Language) is a data-oriented, human readable language mostly use for configuration files)

Open data

Open data refers to the type of data usually offered by government (e.g. Census), organization or research institutions (e.g. ICPSR, Johns Hopkins Coronavirus Resource Center). Some may require an application for access and others may be open for free access (usually via websites or GitHub).

Open data

Since open data are provided by government agencies or research institutions, these data files are often:

  • Structured

  • Well documented

  • Ready for data/research functions

API

  • API stands for Application Programming Interface. It is a web service that allows interactions with, and retrieval of, structured data from a company, organization or government agency.

  • Example:

    • Social media (e.g. Facebook, YouTube, Twitter)

    • Government agency (e.g. Congress)

API

API

Like open data, data available through API are generally:

  • Structured

  • Somewhat documented

  • Not necessary fully open

  • Subject to the discretion of data providers

  • E.g. Not all variables are available, rules may change without announcements, etc.

For the type of found data not available via API or open access, one can use non-API methods to collect this kind of data.  These methods include scraping, which is to simulate web browsing but through automated scrolling and parsing to collect data. These data are usually non-structured and often times noisy.  Researchers also have little control over data generation process and sampling design.

Non-API methods

Non-API methods

Non-API data are generally:

  • Non-structured

  • Noisy

  • Undocumented with no  or little information on sampling

Illustration: Collecting stock data 

This workshop demonstrates how to collect stock data using:

  1. an R package called quantmod

Illustration: Collecting stock data 

This workshop demonstrates how to collect stock data using:

  1. an R package called quantmod

This workshop demonstrates how to collect stock data using:

Link to RStudio Cloud:

https://posit.cloud/content/6625059

- Need a GitHub and RStudio Account

Link to class GitHub:

https://github.com/datageneration/nchu

Workshop II: Data collection

Assignment 1

  1. Install R and RStudio

  2. Download the R program in class GitHub (under codes)

    1. DPR_stockdata.R

  3. Can you download TSM's (台積電) data in the last three years?

  4. Plot the TSM data in the last three years using the sample codes (plot and ggplot2)

Illustration: Collecting COVID data 

This workshop demonstrates how to collect COVID data using:

  1. API methods
    1. Johns Hopkins University Center for Systems Science and Engineering (CSSE) (map | GitHub)
    2. Our world in Data (website | GitHub)
    3. New York Times (GitHub)

Data: Total cases per million

Data: Daily COVID deaths

Data: Death data (Asia)

Data: Death data (Europe)

Data: COVID cases~predictors

Data: COVID cases~predictors

Automated Machine Learning 

Automated Machine Learning 

Automated Machine Learning 

Assignment 2

  1. Download the R program in class GitHub (under codes)

    1. DPR_coviddata1.R

  2. Can you download Taiwan and Germany COVID data in last three years?

  3. Plot the data using the sample codes (use plot and ggplot2 functions)

Assignment 3

  1. Download the R program in class GitHub (under codes)

    1. DPR_caret01.R

  2. Can you predict the chance of Tsai winning including additional variable "indep" (support for Taiwan's independence")?

  3. What is the new accuracy?  Better or worse? 

Assignment 4 (optional for AP)

  1. Download the R program in class GitHub  (under codes)

    1. DPR_tuber01.R

  2. Can you download channel and video data from “中天新聞” and “關鍵時刻”?

  3. Can you create WordClouds for selected video from each channel?

Illustration: Scraping YouTube data 

This workshop demonstrates how to collect YouTube data using:

  1. API method (with Google developer account)

Illustration: Collecting YouTube data 

This workshop demonstrates how to collect YouTube data using Google API:

Link to RStudio Cloud:

https://rstudio.cloud/project/4631380

- Need a GitHub and RStudio Account

Link to class GitHub:

https://github.com/datageneration/nchu

Wordcloud using YouTube data

Wordcloud using YouTube data

Illustration: Scraping Twitter data 

This workshop demonstrates how to collect Twitter data using:

  1. API method (with Twitter developer account)
  2. Non-API method (using Python-based twint)

 

Analytics using Twitter data

Analytics using Twitter data

Ask me anything!

Academia Sinica: Teaching Data Science with AI

By Karl Ho

Academia Sinica: Teaching Data Science with AI

  • 165