Data Wrangling II
Outline
Review
Grouped operations
Joining data.frames
{review}
Steps for analysis
Articulate question of interest
Translate your question into code
Execute your program
{dplyr}
DPLYR's Data Manipulation Grammar
Select the columns of interest
Filter down to rows of interest
Mutate new columns
# Arguments are data.frame, then comma separated column names
my_cols <- select(df, col1, col2, col3)
# Arguments are data.frame, then comma separated boolean operators
my_rows <- filter(df, col1 > col2, col2 < col3, col4 == "hello")
# Arguments are data.frame, then comma separated sorting columns
sorted_df <- arrange(df, col1, desc(col2))
Arrange your data by a column's values
# Arguments are data.frame, then comma separated new columns
new_df <- mutate(df, combined = col1 + col2, diff = col1 - col2)
credit: Nathan Stephens, Rstudio
# Select storm and pressure columns from storms dataframe
storms <- select(storms, storm, pressure)
credit: Nathan Stephens, Rstudio
# Filter down storms to storms with name Ana or Alberto
storms <- filter(storms, storm %in% c('Ana', 'Alberto')
credit: Nathan Stephens, Rstudio
# Add ratio and inverse ratio columns
storms <- mutate(storms, ratio = pressure/wind, inverse = 1/ratio
credit: Nathan Stephens, Rstudio
# Arrange storms by wind
storms <- arrange(storms, wind)
credit: Nathan Stephens, Rstudio
Additional funcitonality
Select helper functions
# Make a data.frame
students <- data.frame(
names=c('Mason', 'Tabi', 'Bryce'),
math_exam1 = c(91, 82, 93),
math_exam2 = c(88, 79, 77),
spanish_exam1 = c(79, 88, 92),
spanish_exam2 = c(99, 92, 92)
)
# Select students + math grades
math_grades <- select(students, names, math_exam1, math_exam2)
# Better yet!
math_grades <- select(students, names, contains("math"))
# See also: starts_with, ends_with, matches
Additional funcitonality
Using select_ (or filter_, mutate_, etc.)
# Make a data.frame
students <- data.frame(
names=c('Mason', 'Tabi', 'Bryce'),
math_exam1 = c(91, 82, 93),
math_exam2 = c(88, 79, 77),
spanish_exam1 = c(79, 88, 92),
spanish_exam2 = c(99, 92, 92)
)
# Why is this useful?
exam_of_interest <- 'math_exam1'
exam_grades <- select_(students, 'names', exam_of_interest)
{exercise 1}
today's exercises inspired by Intro to dplyr documentation
Summarise
Great for calculating summaries
# Compute values of interest
summarise(students,
mean_math1 = mean(math_exam1),
mean_math2 = mean(math_exam2),
mean_math_scores=mean((math_exam1 + math_exam2) / 2)
)
Calculate one value from a set of values
# Compute values of interest
summarise(students,
num_students = n()
)
A nifty trick
Grouped operations
Group data by column values
Ask questions that compare computed values
Data needs to be in the right format
Data shapes
Wide data
Long data
Long data will let you group by column and compare computed values
Reshaping
# Reshape students from wide to long format (just FYI)
varying <- c('math_exam1', 'spanish_exam1', 'math_exam2', 'spanish_exam2')
students_long <- reshape(students,
timevar='exam',
idvar='names',
v.names = 'score',
times=varying,
varying=varying,
direction="long") %>% arrange(names, score)
credit: Nathan Stephens, Rstudio
# Group the pollution data.frame by city for comparison
pollution <- group_by(pollution, city)
credit: Nathan Stephens, Rstudio
# Group the pollution data.frame by city for comparison
pollution <- group_by(pollution, city) %>%
summarise(
mean = mean(amount, na.rm = TRUE),
sum = sum(amount, na.rm = TRUE),
n = n()
)
Grouped data
Which student has the highest average?
Which exam was most difficult?
Which student has the highest math average?
Which Spanish exam had the lowest average score?
{exercise 2}
Joins
Joins
Why is this airport information stored in a separate table?
Joins
Allow you to combine columns from multiple data sources
Foundation of relational databases
Specify the identifying columns shared by the columns
Many types of joins (easy to confuse)
Left-joins
Join two data.frames by shared identifier(s)
# These could be different!
joined_x_y <- left_join(x, y, by='identifier')
joined_y_x <- left_join(x, y, by='identifier')
Order matters!
Returns all rows in x, and all columns for x and y
# Join x and y by 'identifier'
joined <- left_join(x, y, by='identifier')
Left-joins
# Student majors
majors <- data.frame(
student_id=c(1, 2, 3),
major=c('sociology', 'math', 'biology')
)
# Student contact info
contact_info <- data.frame(
student_id=c(1,2),
cell=c('382-842-5873', '593-254-5834')
)
# Left join
joined <- left_join(majors, contact_info, by='student_id')
# Order matters!
joined <- left_join(contact_info, majors, by='student_id')
{exercise 3}
Assignments
Assignment-4: Data wrangling (due Wed. 2/3)
data-wrangling-2
By Michael Freeman
data-wrangling-2
- 1,613