Data Science
What?
Why?
How?
{context}
{what}
Data Science has emerged as a successful field because of the growth of data. Industry specific knowledge is no longer sufficient to manage, analyze, or predict values using the available information.
DJ Patil, U.S. Chief Data Scientist
Healthcare
Measurement and Evaluation
Policing
Education
Data Science: a set of methods for answering questions and making decisions based on heterogeneous data
Borrowed from Joshua Bloomenstock
{why?}
"If we can collect enough data about medical treatments and use that data effectively, we’ll be able to predict more accurately which treatments will be effective for which patient, and which treatments won’t."
{why not?}
{how}
{what do we do to data?}
{nothing}
Above all else, show the data
{clean}
{model}
{predict}
Machine learning and statistics are a set of tools used to ask questions about data. They leverage mathematical concepts and computational abilities to make inferences about relationships, or make predications about unobserved contexts.
In general, you are faced with a tradeoff between:
prediction accuracy
model interpretability
more statistics
more machine learning
(this slide would make some people very mad)
Statistics
machine learning
Machine learning
statistics
Some other thoughts
Machine learning is statistics on a mac
machine learning is statistics minus any checking of models and assumptions -Brian D. Ripley
All valid tools to choose from, but you must select the right tool for the task
Simple to use, difficult to use well
Today we'll be consider a common machine learning task: classification.
We'll attempt to determine if an instance (observation) is a member of a particular class.
In other words, we'll be predicting a categorical variable.
outlook
temp.
humidity
windy
skip class
Let's say I want to predict if a student will come to class...
outlook
temp.
humidity
windy
skip class
Let's say I want to predict if a student will come to class...
outcome
outlook
temp.
humidity
windy
skip class
Let's say I want to predict if a student will come to class...
outcome
attributes or features
each row is an instance
outlook
temp.
humidity
windy
skip class
Write 3 rules to classify observations as skipping/attending class
(if FEATURE(s) is VALUE, OUTCOME is VALUE)
outcome
attributes or features
this is a bit cumbersome....
this is awesome!
Node tests an attribute
Terminal node (leaf) assigns a classification
but how do we do it?
pick attributes that produce the most "pure" branches
repeat....
repeat....
Classification in R (INFO 201 Plug)
# One of many libraries for classification / ML
library(rpart)
# Read in data
homes <- read.csv('part_1_data.csv')
# Use rpart to fit a model: predict `in_sf` using all variables
basic_fit <- rpart(in_sf ~ ., data = homes, method="class")
Assignments
Project Proposals due Friday, anytime
Data visualization post due Monday, before class
Lab feedback session during your next lab
data-science
By Michael Freeman
data-science
- 2,322