ISDSA Workshop: Gentle Introduction to Machine Learning 1/4

Karl Ho is:
- Associate Professor of Instruction at University of Texas at Dallas (UTD) School of Economic, Political and Policy Sciences (EPPS)
- Co-founder of the UTD Social Data Analytics and Research program (SDAR)
- Founder of DataGeneration.org
- Author of Data Programming
- Co-Principal Investigator of the Hong Kong Election Study project
- Website: karlho.com (talks, lecture, publications)

Speaker bio.

Overview

What is Machine Learning?
Statistics and Machine Learning
Statistical modeling: the two cultures
Conventional statistical methods and machine learning
Statistical/Machine Learning methods
- Supervised Learning
- Unsupervised Learning
- Deep Learning

What is machine learning?

The ultimate goal of data modeling is to explain and predict the variable of interest using data. Machine learning is to achieve this goal using computer algorithms in particular to make the prediction and solve the problem.

According to Carnegie Mellon Computer Science professor,

"Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience."

What is machine learning?

To statisticians, the “improve through experience” part is the process of validation or cross validation. Learning can be done through repeated exercises to understand data.

What is machine learning?

Machine learning involves having computer or statistics programs do repeated estimations, like human learns from experience and improve actions and decisions. This is called the training process in machine learning.

Statistics and Machine Learning

Leo Breiman

1928 – 2005

Source: https://en.wikipedia.org/wiki/Leo_Breiman

Statistical Modeling: The Two Cultures
CART (Classification and Regression Trees)

What can Machine Learning do and do better?

Machine Learning can do:

Prediction
Classification
Give useful information for problem solving and decision making

Machine Learning vs. Conventional statistical methods

Source: Attewell, Paul A. & Monaghan, David B. 2015. Data Mining for the Social Sciences: an Introduction, Table 2.1, p. 27

Statistics: testing hypotheses
Machine learning: finding the right hypothesis
Overlap:
Decision trees (C4.5 and CART)
Nearest-neighbor methods
Bridging the two:
Most machine learning algorithms employ statistical techniques

hypothesis confirmation

hypothesis formation

Machine Learning vs. Conventional statistical methods

Supervised vs. Unsupervised Machine Learning

With vs. without known $y$
From statistical point of view, unsupervised machine learning:
- Identify pattern about data
- Seek information about $y$ or the dependent variable

Illustration: Dimensionality of data

Source: https://medium.com/analytics-vidhya/classifying-malignant-or-benignant-breast-cancer-using-svm-fe36f139dd21

Illustration: Support Vector Machines

Source: https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93

Machine learning methods

Regression	Classification	Clustering	Q-Learning
Linear regression	Logistic regression	- K-Means Clustering	State Action Reward State Action (SARSA)
Polynomial regression	K-Nearest Neighbors	- Hierarchical Clustering	Deep Q-Network
Support vector regression	Support Vector Machines	Dimensionality Reduction	Markov Decision Processes
Ridge Regression	Kernal Support Vector Machines	Principal Component Analysis	Deep Deterministic Policy Gradient (DDPG)
Lasso	Naïve Bayes	Linear Discriminant Analysis
ElasticNet	Decision Tree	Kernal PCA
Decision tree	Random forest
Random forest

Supervised Learning	Unsupervised Learning	Reinforcement Learning

Machine learning methods

Text as data
- Natural Language Processing (NLP)
- e.g. Speech data, tweets, social media
Images and Videos
Spatial data via Remote sensing or Light Detection and Ranging (LIDAR)
Size and complexity of data generation process warrant machine assisted processing and analytics