Social and Political Data Science: Introduction

Knowledge Mining 

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Introduction 

Illustration: Dimensionality of data

Illustration: Support Vector Machines

Overview

  • What is Knowledge Mining?

    • Knowledge Discovery + Data Mining

    • Statistics and Machine Learning

  • Statistical modeling: the two cultures

  • Conventional statistical methods and machine learning

  • Statistical/Machine Learning methods

What is Knowledge Mining?

  • Knowledge Discovery + Data Mining

  • Wei et al. (2003):

    • Knowledge discovery refers to the overall process of discovering useful knowledge from data

    • Data mining refers to the extraction of patterns from data.

What is Knowledge Mining?

  • Knowledge discovery can be performed on structured databases

Source: Wei, Chih-Ping, Selwyn Piramuthu, and Michael J. Shaw. "Knowledge discovery and data mining." In Handbook on Knowledge Management, pp. 157-189. Springer, Berlin, Heidelberg, 2003.

What is Knowledge Mining?

  • Data mining generally refers to the methods or techniques used to identify the patterns in data,

  • It can be broadly structured into several categories:

    • classification

    • clustering

    • dependency analysis

    • text mining

What is Knowledge Mining?

Source: Wei, Chih-Ping, Selwyn Piramuthu, and Michael J. Shaw. "Knowledge discovery and data mining." In Handbook on Knowledge Management, pp. 157-189. Springer, Berlin, Heidelberg, 2003.

What is Knowledge Mining?

  • Both classes of methods are subsumed under Machine Learning now

  • More generally under Unsupervised Machine Learning

    • Pattern recognition

    • Group/class identication

    • Hypothesis formation (vs. Hypothesis confirmation)

 

Bonferroni's principle

(roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.

 

- Rajaraman A., Leskovec J. and Ullman J.  Mining of Massive Datasets

Today we live in a data rich, information driven, knowledge strained, and wisdom scant world.

- Graham Williams 2021

- Rajaraman A., Leskovec J. and Ullman J.  Mining of Massive Datasets

Data mining overlaps with:

  • Databases: Large-scale data, simple queries
  • Machine learning: Small data, Complex models
  • CS Theory: (Randomized) Algorithms 

Different cultures:

  • To a Database person, data mining is an extreme form of analytic processing – queries that examine large amounts of data
    • Result is the query answer
  • To a Machine Learning person, data-mining is the inference of models
    • Result is the parameters of the model

Data Mining and Machine Learning

  • association rules
  • recursive partitioning or decision trees, including CART (classification and regression trees) and CHAID (chi-squared automatic interaction detection), boosted trees, forests, and bootstrap forests
  • multi-layer neural network models and “deep learning” methods
  • naive Bayes classifiers and Bayesian networks
  • clustering methods, including hierarchical, k-means, nearest neighbor, linearand nonlinear manifold clustering
  • support vector machines
  • “soft modeling” or partial least squares latent variable modeling

Data Mining methods

What is machine learning?

Field of study that gives computers the ability to learn without being explicitly programmed.

- Arthur Samuel 1959

A computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program.

Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.

What is machine learning?

The ultimate goal of data modeling is to explain and predict the variable of interest using data. Machine learning is to achieve this goal using computer algorithms in particular to make the prediction and solve the problem.

According to Carnegie Mellon Computer Science professor Tom M. Mitchell,

"Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience." 

 

What is machine learning?

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”

Study the past if you would define the future.

– Confucius

Tom Mitchell. 1997. Machine Learning, McGraw Hill.

What is machine learning?

Machine learning is a science of the artificial. The field's main objects of study are artifacts, specifically algorithms that improve their performance with experience.

- Langley, 1996

Machine learning is programming computers to optimize a performance criterion using example data or past experience.’

- Alpaydin, 2004
 

What is machine learning?

Machine learning is an area of artificial intelligence concerned with the study of computer algorithms that improve automatically through experience. In practice, this involves creating programs that optimize a performance criterion through the analysis of data.

- Sewell, 2006

What is machine learning?

To statisticians, the “improve through experience” part is the process of validation or cross validation. Learning can be done through repeated exercises to understand data.

What is machine learning?

Machine learning involves having computer or statistics programs do repeated estimations, like human learns from experience and improve actions and decisions. This is called the training process in machine learning.

Statistics, Knowledge Mining and Machine Learning

Statistics refresher

Statistics:

  • Find and test data (data production and collection)
    • Made data (surveys, experiments, interviews) based on theory and hypotheses
    • Found data (web data, social data, machine generated data) from all sources
  • Make data ready for analysis (data management)
  • Explore data (means, variances, distribution) - Descriptive statistics
  • Explain data (correlation, cross-tabulation, regression) - Inferential statistics

     

Parametric vs. Non-parametric models

  • Sample and population

  • Generalization

  • Representation

Leo Breiman

What can Machine Learning do and do better?

Machine Learning can do:

  • Prediction
  • Classification
  • Give useful information for problem solving and decision making

Machine Learning and Conventional statistical methods

  • Statistics: testing hypotheses

  • Machine learning: finding the right hypothesis

  • Overlap:
    Decision trees (C4.5 and CART)
    Nearest-neighbor methods

  • Bridging the two:
    Most machine learning algorithms employ statistical techniques

Statistical Modeling:
The Two Cultures 

Leo Breiman 2001: Statistical Science 

One assumes that the data are generated by a given stochastic data model.
The other uses algorithmic models and treats the data mechanism as unknown.
Data Model
Algorithmic Model
Small data
Complex, big data

Theory:
Data Generation Process

Data are generated in many fashions.   Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.

Theory:
Data Generation Process

Data Model

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.

Response Variable= f(Predictor variables, random noise, parameters)

Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.  

Theory:
Data Generation Process

Data Model

The values of the parameters are estimated from the data and the model then used for information and/or prediction.

Theory:
Data Generation Process

 Algorithmic Modeling

The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.

The goal is to find algorithm that accurately predicts y.

Theory:
Data Generation Process

 Algorithmic Modeling

Source: https://www.mathworks.com

Machine Learning and Conventional statistical methods

Source: Attewell, Paul A. & Monaghan, David B. 2015. Data Mining for the Social Sciences: an Introduction, Table 2.1, p.  27

Machine learning methods

Regression Classification Clustering Q-Learning
Linear regression Logistic regression - K-Means Clustering State Action Reward State Action (SARSA)
Polynomial regression K-Nearest Neighbors - Hierarchical Clustering Deep Q-Network
Support vector regression Support Vector Machines Dimensionality Reduction Markov Decision Processes
Ridge Regression Kernal Support Vector Machines Principal Component Analysis Deep Deterministic Policy Gradient (DDPG)
Lasso Naïve Bayes Linear Discriminant Analysis
ElasticNet Decision Tree Kernal PCA
Decision tree Random forest
Random forest
Supervised Learning Unsupervised Learning Reinforcement Learning

Machine Learning and Conventional statistical methods

  • Statistics: testing hypotheses

  • Machine learning: finding the right hypothesis

  • Overlap:
    Decision trees (C4.5 and CART)
    Nearest-neighbor methods

  • Bridging the two:
    Most machine learning algorithms employ statistical techniques

Wickham and Grolemund: Data analytics: 

  • Hypothesis generation

  • Hypothesis confirmation

Which one goes first?

Trevor Hastie and Robert Tibshirani

Bradley Efron and Trevor Hastie

Judea Pearl and Dana MacKenzie

Everyone is a teacher because the ability to teach anything connects to the ability to learn what is being taught.

- John Sibert

Learn like you are to teach.

Q & A

Question: Why R?

Answer: There are many software/platform options including Python, Weka, SAS JMP and SPSS.  R is most accessible and not proprietary to specific method.  Coupled with other features and systems (e.g. visualization and parallel processing), it facilitates data programming and presentation in a coherent ecosystem.

Question: Can I take other courses?

Answer: Yes, recommended GISC6323 Machine Learning for Socio-Economic and Geo-Referenced Data by Dr. Michael Tiefelsdorf. Online: DataCamp.

Question: Advanced Math is prerequisite?

Answer: No, but that will help.  Recommended: A Mathematics Course for Political and Social Research, by Will H. Moore and David A. Siegel

Question: Why only focus on prediction? What about inference?

Answer: New developments in Data science actually is putting more emphasis on inference.  This course is designed to bridge the two.

Knowledge Mining: Introduction

By Karl Ho

Knowledge Mining: Introduction

  • 264