Data Science

What?

 

Why?

How?

{context}

{what}

Data Science has emerged as a successful field because of the growth of data. Industry specific knowledge is no longer sufficient to manage, analyze, or predict values using the available information.

DJ Patil, U.S. Chief Data Scientist

Healthcare

Measurement and Evaluation

Policing

Education

Data Science: a set of methods for answering questions and making decisions based on heterogeneous data

Borrowed from Joshua Bloomenstock

{why?}

"If we can collect enough data about medical treatments and use that data effectively, we’ll be able to predict more accurately which treatments will be effective for which patient, and which treatments won’t."

{why not?}

{how}

{what do we do to data?}

{nothing}

Above all else, show the data

{clean}

{model}

{predict}

Machine learning and statistics are a set of tools used to ask questions about data.  They leverage mathematical concepts and computational abilities to make inferences about relationships, or make predications about unobserved contexts.

In general, you are faced with a tradeoff between:

prediction accuracy

model interpretability

more statistics

more machine learning

(this slide would make some people very mad)

Statistics

machine learning

Machine learning

statistics

Some other thoughts

Machine learning is statistics on a mac

machine learning is statistics minus any checking of models and assumptions -Brian D. Ripley

All valid tools to choose from, but you must select the right tool for the task

Simple to use, difficult to use well

Today we'll be consider a common machine learning task: classification.  

We'll attempt to determine if an instance (observation) is a member of a particular class.

In other words, we'll be predicting a categorical variable.

outlook

temp.

humidity

windy

skip class

Let's say I want to predict if a student will come to class...

outlook

temp.

humidity

windy

skip class

Let's say I want to predict if a student will come to class...

outcome

outlook

temp.

humidity

windy

skip class

Let's say I want to predict if a student will come to class...

outcome

attributes or features

each row is an instance

outlook

temp.

humidity

windy

skip class

Write 3 rules to classify observations as skipping/attending class 

(if FEATURE(s) is VALUE, OUTCOME is VALUE)

outcome

attributes or features

this is a bit cumbersome....

this is awesome!

Node tests an attribute

Terminal node (leaf) assigns a classification

but how do we do it?

pick attributes that produce the most "pure" branches

repeat....

repeat....

Classification in R (INFO 201 Plug)

# One of many libraries for classification / ML
library(rpart)

# Read in data
homes <- read.csv('part_1_data.csv')

# Use rpart to fit a model: predict `in_sf` using all variables
basic_fit <- rpart(in_sf ~ ., data = homes, method="class")

Assignments

Project Proposals due Friday, anytime

Data visualization post due Monday, before class

Lab feedback session during your next lab

Made with Slides.com