Abdullah Fathi
Wrong data class prevents calculations to be performed
Missing data prevents
functions to work properly
Outliers corrupt the output and produce bias
The size of the data requires too much computation
data.frame
data.table
data_frame
MCAR - Missing Completely At Random
MAR - Missing At Random
MNAR - Missing Not At Random
Deleting
Amount of NA's is < 5% and they are MCAR
Hot Deck Imputation
Imputation with mean
Replacing NA's with the mean, results in less bias
Interpolation
Replacement is identified via a pre-defined algorithm
Use a function with a built-in NA handling feature
Use a basic function dedicated to NA handling
Use an advanced imputation tool
An outlier is an observation which deviates from the other observations as to arouse suspicions that it was generated by a different mechanism - Hawkins
An Outlier is an observation that is significantly different from the generating mechanism or a statistical process - Statistician
Even a couple of outliers can completely change the result
Leave them as they are
Delete them
Exchange them with another value
Deleting and exchanging methods have their drawbacks with small samples
Assumption: The data point is faulty
An outlier totally out of scale might be a wrong measurement
Outliers can be valid measurements and they might reveal hidden potentials
Does tidyverse offer a solution to all data science tasks and challenge?
Importing Data
Data Cleaning
Data Visualization
Custom Functions
%$%
Alternative to attach()
%<>%
Alternative to assign()
%T>%
Inserting an intermediary step
A "data.frame" and a new column can be of different length -> recycling
A "tibble" and the newly provided data have to be of equal length -> no recycling
Exception: single value
Raw Data
Tidy Data
Different data types and classes require specific cleaning methods
Each variable has its own column
Each observation has its own row
Each observational unit forms one table
Character/String: Text data or data R considers text ("string")
Library "dplyr"
We may have many sources of input data, and at some point, we need to combine them. A join with dplyr adds variables to the right of the original dataset. The beauty is dplyr is that it handles four types of joins similar to SQL
left_join()
right_join()
inner_join()
full_join()
install.packages("dplyr")
left_join(df_primary, df_secondary, by ='ID')
right_join(df_primary, df_secondary, by = 'ID')
left_join(df_primary, df_secondary, by = c('ID', 'year'))
we can have multiple keys in our dataset. Consider the following dataset where we have years or a list of products bought by the customer.
inner_join(df_primary, df_secondary, by ='ID')
When we are 100% sure that the two datasets won't match, we can consider to return only rows existing in both dataset
full_join(df_primary, df_secondary, by = 'ID')
full_join() function keeps all observations and replace missing values with NA.
The function summarise() is compatible with subsetting.
Another useful function to aggregate the variable is sum().
Spread in the data is computed with the standard deviation or sd() in R
Access the minimum and the maximum of a vector with the function min() and max().
Count observations by group is always a good idea. With R, we can aggregate the the number of occurence with n()
Select the first, last or nth position of a group
The function nth() is complementary to first() and last(). We can access the nth observation within a group with the index to return
The function n() returns the number of observations in a current group. A closed function to n() is n_distinct(), which count the number of unique values
A summary statistic can be realized among multiple groups
Before we intend to do an operation, we can filter the dataset
We need to remove the grouping before we want to change the level of the computation
The syntax of summarise() is basic and consistent with the other verbs included in the dplyr library
summarise(df, variable_name=condition)
# arguments:
# - `df`: Dataset used to construct the summary statistics
# - `variable_name=condition`: Formula to create the new variable
group_by works perfectly with all the other verbs (i.e. mutate(), filter(), arrange(), ...)
Basic |
mean() |
Average of vector x |
|
median() |
Median of vector x |
|
sum() |
Sum of vector x |
variation |
sd() |
standard deviation of vector x |
|
IQR() |
Interquartile of vector x |
Range |
min() |
Minimum of vector x |
|
max() |
Maximum of vector x |
|
quantile() |
Quantile of vector x |
Position |
first() |
Use with group_by() First observation of the group |
|
last() |
Use with group_by(). Last observation of the group |
|
nth() |
Use with group_by(). nth observation of the group |
Count |
n() |
Use with group_by(). Count the number of rows |
|
n_distinct() |
Use with group_by(). Count the number of distinct observations |
Objective
Function
Description
Graphs are an incredible tool to simplify complex analysis
Graphs are the third part of the process of data analysis. The first part is about data extraction, the second part deals with cleaning and manipulating the data. At last, we need to visualize our results graphically.
ggplot2 is very flexible, incorporates many themes and plot specification at a high level of abstraction. With ggplot2, we can't plot 3-dimensional graphics and create interactive graphics
ggplot(data, mapping=aes()) +
geometric object
# arguments:
# data: Dataset used to plot the graph
# mapping: Control the x and y-axis
# geometric object: The type of plot you want to show. The most common object are:
# - Point: `geom_point()`
# - Bar: `geom_bar()`
# - Line: `geom_line()`
# - Histogram: `geom_histogram()`
One solution to make our data less sensitive to outliers is to rescale them
We can add another level of information to the graph. We can plot the fitted value of a linear regression.
Graphs need to be informative and good labels. We can add labels with labs()function
lab(title = "Hello Fathi")
# argument:
# - title: Control the title. It is possible to change or add title with:
# - subtitle: Add subtitle below title
# - caption: Add caption below the graph
# - x: rename x-axis
# - y: rename y-axis
# Example:lab(title = "Hello Fathi", subtitle = "My first plot")
The library ggplot2 includes eights themes:
ggsave("my_fantastic_plot.png")
Store graph right after we plot it
Box plot helps to visualize the distribution of the data by quartile and detect the presence of outliers
Bar chart is a great way to display categorical variables in the x-axis. This type of graph denotes two aspects in the y-axis.
ggplot(data, mapping = aes()) +
geometric object
# arguments:
# data: dataset used to plot the graph
# mapping: Control the x and y-axis
# geometric object: The type of plot you want to show. The most common objects are:
# - Point: `geom_point()`
# - Bar: `geom_bar()`
# - Line: `geom_line()`
# - Histogram: `geom_histogram()`
# - `stat`: Control the type of formatting. By default, `bin` to plot a count in the y-axis. For continuous value, pass `stat = "identity"`
# - `alpha`: Control density of the color
# - `fill`: Change the color of the bar
# - `size`: Control the size the bar
Four arguments can be passed to customize the graph
Represent the group of variables with values in the y-axis
install.packages("leaflet")
# to install the development version from Github, run
# devtools::install_github("rstudio/leaflet")
There are no secrets to success. It is the result of preparation, hard work, and learning from failure. - Colin Powell