Data Wrangling I

Outline

Mapping questions to operations

Operations with data.frames

The dplyr package

{mapping questions to operations}

Steps for analysis

Articulate question of interest

Translate your question into code

Execute your program

Vehicle data

Write down 5 questions you have about this dataset

Questions

Which model car had the highest highway MPG?

What were the makes of the top 5 MPG cars in 1997?

What type of fuel is used by the car with the lowest city MPG?

 What is the class of 2-Wheel Drive vehicles that get > 20 miles/gallon?

Select a column

Which model car had the highest highway MPG?

What were the makes of the top 5 MPG cars in 1997?

What type of fuel is used by the car with the lowest city MPG?

 What is the class of 2-Wheel Drive vehicles that get > 20 miles/gallon?

Filter rows 

Which model car had the highest highway MPG?

What were the makes of the top 5 MPG cars in 1997?

What type of fuel is used by the car with the lowest city MPG?

 What is the class of 2-Wheel Drive vehicles that get > 20 miles/gallon?

A Grammar of Data Manipulation

Select particular columns

Filter down to specific rows

Arrange (sort) your dataset by values

Mutate your dataframe to add a column

Summarise your dataframe (calculate summary info, mean)

{exercise 1: operations with data.frames}

{dplyr}

DPLYR

"A grammar for data manipulation"

Provides verbs for common tasks

More readable, efficient code

Written by Hadley Wickham

Common Verbs

Select the columns of interest

Filter down to rows of interest

Mutate new columns

# Arguments are data.frame, then comma separated column names
my_cols <- select(df, col1, col2, col3)
# Arguments are data.frame, then comma separated boolean operators
my_rows <- filter(df, col1 > col2, col2 < col3, col4 == "hello")
# Arguments are data.frame, then comma separated sorting columns
sorted_df <- arrange(df, col1, desc(col2))

Arrange your data by a column's values

# Arguments are data.frame, then comma separated new columns
new_df <- mutate(df, combined = col1 + col2, diff = col1 - col2)

credit: Nathan Stephens, Rstudio

# Select storm and pressure columns from storms dataframe
storms <- select(storms, storm, pressure)

credit: Nathan Stephens, Rstudio

# Filter down storms to storms with name Ana or Alberto
storms <- filter(storms, storm %in% c('Ana', 'Alberto')

credit: Nathan Stephens, Rstudio

# Add ratio and inverse ratio columns
storms <- mutate(storms, ratio = pressure/wind, inverse = 1/ratio

credit: Nathan Stephens, Rstudio

# Arrange storms by wind
storms <- arrange(storms, wind)

An example

Some sample data

Who got raises?

# Create a vector of 100 employees ("Employee 1", "Employee 2")
employees <- paste('Employee', 1:100)

# Create a vector of 2014 salaries using the runif function
salaries_2014 <- runif(100, 40000, 50000)

# Create a vector of 2015 salaries that are typically higher 2014
salaries_2015 <- salaries_2014 + runif(100, -5000, 10000)

# Create a data.frame 'salaries' by combining these vectors 
salaries <- data.frame(employees, salaries_2014, salaries_2015)
# Mutate to calculate raises
salaries <- mutate(salaries, raise = salaries_2015 > salaries_2014)
filter(salaries, raise==TRUE)

{exercise 2}

Chaining methods

What we've been doing

# What is the class of the vehicle with the best hwy mpg in 1996?
best_car_96 <- filter(vehicles, 
                    year == 1996, 
                    hwy == max(hwy[year == 1996])
               )
class_name <- select(best_car_96, class)

Chaining methods

Nesting functions

# What is the class of the vehicle with the best hwy mpg in 1996?
class_name <- select(
                 filter(vehicles, 
                    year == 1996, 
                    hwy == max(hwy[year == 1996])
                 ), 
                class
              )

best_car_1996

Chaining methods

The pipe operator

%>%

credit: Nathan Stephens, Rstudio

Chaining methods

The pipe operater

# What is the class of the vehicle with the best hwy mpg in 1996?
best_car_96 <- filter(vehicles, 
                    year == 1996, 
                    hwy == max(hwy[year == 1996])
               ) %>%
               select(class)

Pass the results in as the first argument to the next function

{exercise 3}

Assignments

Assignment-4: Data wrangling (due Wed. 2/3)

data-wrangling-1

By Michael Freeman

data-wrangling-1

  • 1,686