Abdullah Fathi
R is widely used (statisticians, scientists, social scientists) and has the widest statistical functionality of any software
As a scripting language, R is very powerful, flexible, and easy to use
install.packages("package name")
if (!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}
# Declare variables of different types
# Numeric
x <- 28
class(x)
A variable can store a number, an object, a statistical result, vector, dataset, a model prediction basically anything R outputs
We can use that variable later simply by calling the name of the variable
To declare a variable, we need to assign a variable name. The name should not have space. We can use _ to connect to words
OPERATOR | DESCRIPTION | EXAMPLE |
---|---|---|
x + y | y added to x | 2 + 3 = 5 |
x – y | y subtracted from x | 8 – 2 = 6 |
x * y | x multiplied by y | 3 * 2 = 6 |
x / y | x divided by y | 10 / 5 = 2 |
x ^ y (or x ** y) | x raised to the power y | 2 ^ 5 = 32 |
x %% y | remainder of x divided by y (x mod y) | 7 %% 3 = 1 |
x %/% y | x divided by y but rounded down (integer divide) | 7 %/% 3 = 2 |
We can add many conditional statements as we like but we need to include them in a parenthesis
# Create a vector from 1 to 10
logical_vector <- c(1:10)
logical_vector>5
# Print value strictly above 5
logical_vector[(logical_vector>5)]
The simplest way to build a vector in R, is to use the
c command
# Numerical
vec_num <- c(1, 10, 49)
vec_num
# Character
vec_chr <- c("a", "b", "c")
vec_chr
# Boolean
vec_bool <- c(TRUE, FALSE, TRUE)
vec_bool
# Slice the first five rows of the vector
slice_vector <- c(1,2,3,4,5,6,7,8,9,10)
slice_vector[1:5]
Shortest way to create a range of value is to use the
':' between two numbers
# Faster way to create adjacent values
c(1:10)
A matrix is a 2-dimensional array that has m number of rows and n number of columns. In other words, matrix is a combination of two or more vectors with the same data type.
matrix(data, nrow, ncol, byrow = FALSE)
# Arguments:
# - data: The collection of elements that R will arrange into the rows and columns of the matrix
# - nrow: Number of rows
# - ncol: Number of columns
# - byrow: The rows are filled from the left to the right. We use `byrow = FALSE` (default values), if we want the # matrix to be filled by the columns i.e. the values are filled top to bottom.
# Print dimension of the matrix with dim()
dim(matrix_a)
# concatenate c(1:5) to the matrix_a
matrix_a1 <- cbind(matrix_a, c(1:5))
# Check the dimension
dim(matrix_a1)
matrix_c <-matrix(1:12, byrow = FALSE, ncol = 3)
# Create a vector of 3 columns
add_row <- c(1:3)
# Append to the matrix
matrix_c <- rbind(matrix_c, add_row)
# Check the dimension
dim(matrix_c)
We can select elements one or many elements from a matrix by using the square brackets [ ].
Type of variable that is essentially used to refer to qualitative/categorical variable
R stores categorical variables into a factor. Characters are not supported in machine learning algorithm, and the only way is to convert a string to an integer.
factor(x = character(), levels, labels = levels, ordered = is.ordered(x))
# Arguments:
# - x: A vector of data. Need to be a string or integer, not decimal.
# - Levels: A vector of possible values taken by x. This argument is optional. The default value is the unique list of # items of the vector x.
# - Labels: Add a label to the x data. For example, 1 can take the label `male` while 0, the label `female`.
# - ordered: Determine if the levels should be ordered.
A categorical variable has several values but the order does not matter. For instance, male or female categorical variable do not have ordering.
Ordinal categorical variables do have a natural ordering. We can specify the order, from the lowest to the highest with order = TRUE and highest to lowest with order = FALSE.
Continuous class variables are the default value in R. They are stored as numeric or integer.
This is the most commonly used member of data types family. It is used to store tabular data. It is different from matrix. In a matrix, every element must have same class. But, in a data frame, you can put list of vectors containing different classes. This means, every column of a data frame acts like a list.
We can create a data frame by passing the variable a,b,c,d into the data.frame() function. We can name the columns with name() and simply specify the name of the variables
data.frame(df, stringsAsFactors = TRUE)
# arguments:
# -df: It can be a matrix to convert as a data frame or a collection of variables to join
# -stringsAsFactors: Convert string to factor by default
# Name the data frame
names(df) <- c('ID', 'items', 'store', 'price')
A data frame is composed of rows and columns, df[A, B]. A represents the rows and B the columns. We can slice either by specifying the rows and/or columns.
In below diagram we display how to access different selection of the data frame:
You can also append a column to a Data Frame. You need to use the symbol $ to append a new variable.
# Create a new vector
quantity <- c(10, 35, 40, 5)
# Add `quantity` to the `df` data frame
df$quantity <- quantity
df
Sometimes, we need to store a column of a data frame for future use or perform operation on a column. We can use the $ sign to select the column from a data frame
# Select the column ID
df$ID
Sometimes, we need to store a column of a data frame for future use or perform operation on a column. We can use the $ sign to select the column from a data frame
subset(x, condition)
# arguments:
# - x: data frame used to perform the subset
# - condition: define the conditional statement
We can get a quick look at the bottom of the data frame with tail() function. By analogy, head() displays the top of the data frame
head(df)
tail(df)
A list is a great tool to store many kinds of object in the order expected. We can include matrices, vectors data frames or lists.
list(element_1, ...)
# arguments:
# -element_1: store any type of R object
# -...: pass as many objects as specifying. each object needs to be separated by a comma
# Print second element of the list
my_list[[2]]
Sort a vector of continuous variable or factor variable. Arranging the data can be of ascending or descending order
order() returns the indices of the vector in sorted order
order(x):
# Argument:
# -x: A vector containing continuous or factor variable
the result of sort() is a vector consisting of elements of the original (unsorted) vector
sort(x, decreasing = FALSE, na.last = TRUE):
# Argument:
# -x: A vector containing continuous or factor variable
# -decreasing: Control for the order of the sort method. By default, decreasing is set to `FALSE`.
# -na.last: Indicates whether the `NA` 's value should be put last or not
Data analysis can be divided into three parts
We may have many sources of input data, and at some point, we need to combine them. A join with dplyr adds variables to the right of the original dataset. The beauty is dplyr is that it handles four types of joins similar to SQL
left_join()
right_join()
inner_join()
full_join()
install.packages("dplyr")
left_join(df_primary, df_secondary, by ='ID')
right_join(df_primary, df_secondary, by = 'ID')
inner_join(df_primary, df_secondary, by ='ID')
When we are 100% sure that the two datasets won't match, we can consider to return only rows existing in both dataset
full_join(df_primary, df_secondary, by = 'ID')
full_join() function keeps all observations and replace missing values with NA.
left_join(df_primary, df_secondary, by = c('ID', 'year'))
we can have multiple keys in our dataset. Consider the following dataset where we have years or a list of products bought by the customer.
arrange(.data, ...)
# Argument:
# - .data: data frame variable.
# - ...: Comma separated list of unquoted variable names.
The dplyr function arrange() can be used to reorder (sort) rows by one or more variables.
4 functions to tidy our data using tidyr
gather(): Transform the data from wide to long
spread(): Transform the data from long to wide
separate(): Split one variable into two
unite(): Unit two variables into one
install.packages("tidyr")
The objectives of the gather() function is to transform the data from wide to long
gather(Messy, quarter, growth, q1_2017:q4_2017)
The spread() function does the opposite of gather.
spread(data, key, value)
# arguments:
# data: The data frame used to reshape the dataset
# key: Column to reshape long to wide
# value: Rows used to fill the new column
The separate() function splits a column into two according to a separator. This function is helpful in some situations where the variable is a date.
separate(data, col, into, sep= "", remove = TRUE)
# arguments:
# -data: The data frame used to reshape the dataset
# -col: The column to split
# -into: The name of the new variables
# -sep: Indicates the symbol used that separates the variable, i.e.: "-", "_", "&"
# -remove: Remove the old column. By default sets to TRUE.
The unite() function concanates two columns into one.
unite(data, col, conc ,sep= "", remove = TRUE)
# arguments:
# -data: The data frame used to reshape the dataset
# -col: Name of the new column
# -conc: Name of the columns to concatenate
# -sep: Indicates the symbol used that unites the variable, i.e: "-", "_", "&"
# -remove: Remove the old columns. By default, sets to TRUE
Normally, we have data from multiple sources. To perform an analysis, we need to merge two dataframes together with one or more common key variables
A full match returns values that have a counterpart in the destination table. The values that are not match won't be return in the new data frame. The partial match, however, return the missing values as NA.
We will see a simple inner join. The inner join keyword selects records that have matching values in both tables. To join two datasets, we can use merge() function
merge(x, y, by.x = x, by.y = y)
# Arguments:
# -x: The origin data frame
# -y: The data frame to merge
# -by.x: The column used for merging in x data frame. Column x to merge on
# -by.y: The column used for merging in y data frame. Column y to merge on
It is not surprising that two dataframes do not have the same common key variables. In the full matching, the dataframe returns only rows found in both x and y data frame.
With partial merging, it is possible to keep the rows with no matching rows in the other data frame. These rows will have NA in those columns that are usually filled with values from y. We can do that by setting all.x= TRUE
The merge() function allows four ways of combining data:
Natural join: To keep only rows that match from the data frames, specify the argument all=FALSE.
Full outer join: To keep all rows from both data frames, specify all=TRUE.
Left outer join: To include all the rows of your data frame x and only those from y that match, specify all.x=TRUE.
Right outer join: To include all the rows of your data frame y and only those from x that match, specify all.y=TRUE.
A function, in a programming environment, is a set of instructions. A programmer builds a function to avoid repeating the same task, or reduce complexity.
A function should be
written to carry out a specified tasks
may or may not include arguments
contain a body
may or may not return one or more values.
A general approach to a function is to use the argument part as inputs, feed the body part and finally return an output
function (arglist) {
#Function body
}
There are a lot of built-in function in R. R matches your input parameters with its function arguments, either by value or by position, then executes the function body. Function arguments can have default values: if you do not specify these arguments, R will take the default value.
It is possible to see the source code of a function by running the name of the function itself in the console.
We will see three groups of function in action
General function
Maths function
Statistical function
We are already familiar with general functions like cbind(), rbind(),range(),sort(),order() functions. Each of these functions has a specific task, takes arguments to return an output.
abs(x) |
Takes the absolute value of x |
log(x,base=y) |
Takes the logarithm of x with base y; if base is not specified, returns the natural logarithm |
exp(x) |
Returns the exponential of x |
sqrt(x) |
Returns the square root of x |
factorial(x) |
Returns the factorial of x (x!) |
mean(x) |
Mean of x |
median(x) |
Median of x |
var(x) |
Variance of x |
sd(x) |
Standard deviation of x |
scale(x) |
Standard scores (z-scores) of x |
quantile(x) |
The quartiles of x |
summary(x) |
Summary of x: mean, min, max etc.. |
In some occasion, we need to write our own function because we have to accomplish a particular task and no ready made function exists. A user-defined function involves a name, arguments and a body
function.name <- function(arguments)
{
computations on the arguments
some other code
}
we define a simple square function. The function accepts a value and returns the square of the value
square_function<- function(n)
{
# compute the square of integer `n`
n^2
}
# calling the function and passing value 4
square_function(4)
Code Explanation:
The function is named square_function; it can be called whatever we want.
It receives an argument "n". We didn't specify the type of variable so that the user can pass an integer, a vector or a matrix
The function takes the input "n" and returns the square of the input.
When you are done using the function, we can remove it with the rm() function.
rm(square_function)
square_function
In R, the environment is a collection of objects like functions, variables, data frame, etc.
The top-level environment available is the global environment, called R_GlobalEnv. And we have the local environment.
# List the content of the current environment
ls(environment())
The function f returns the output 15. This is because y is defined in the global environment. Any variable defined in the global environment can be used locally. The variable y has the value of 10 during all function calls and is accessible at any time
Let's see what happens if the variable y is defined inside the function.
We need to drop `y` prior to run this code using rm r
The output is also 15 when we call f(5) but returns an error when we try to print the value y. The variable y is not in the global environment.
Finally, R uses the most recent variable definition to pass inside the body of a function
R ignores the y values defined outside the function because we explicitly created a y variable inside the body of the function.
We can write a function with more than one argument. Consider the function called "times". It is a straightforward function multiplying two variables.
times <- function(x,y) {
x*y
}
times(2,4)
When need to do many repetitive tasks
Sometimes, we need to include conditions into a function to allow the code to return different outputs.
# Example:
split_data <- function(df, train = TRUE)
# Arguments:
# -df: Define the dataset
# -train: Specify if the function returns the train set or test set. By default, set to TRUE
sqldf() from the package sqldf allows the use of SQLite queries to select and manipulate data in R
install.packages("sqldf")
A control structure ‘controls’ the flow of code / commands written inside a function
IF, ELSE, ELSEIF Statement
For Loop
While Loop
This structure is used to test a condition
if (<condition>){
##do something
} else {
##do something
}
We can further customize the control level with the else if statement. With else if, you can add as many conditions as we want
if (condition1) {
expr1
} else if (condition2) {
expr2
} else if (condition3) {
expr3
} else {
expr4
}
This structure is used when a loop is to be executed fixed number of times. It is commonly used for iterating over the elements of an object (list, vector)
for (i in vector){
#do something
}
It begins by testing a condition, and executes only if the condition is found to be true. Once the loop is executed, the condition is tested again
#initialize a condition
Age <- 12
#check if age is less than 17
while(Age < 17){
print(Age)
Age <- Age + 1 #Once the loop is executed, this code breaks the loop
}
The apply() family pertains to the R base package and is populated with functions to manipulate slices of data from matrices, arrays, lists and dataframes in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs
We use apply() over a matrice
apply(X, MARGIN, FUN)
# Here:
# -x: an array or matrix
# -MARGIN: take a value or range between 1 and 2 to define where to apply the function:
# -MARGIN=1`: the manipulation is performed on rows
# -MARGIN=2`: the manipulation is performed on columns
# -MARGIN=c(1,2)` the manipulation is performed on rows and columns
# -FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied>
The code apply(m1, 2, sum) will apply the sum function to the matrix 5x6 and return the sum of each column accessible in the dataset
l in lapply() stands for list.
The difference between lapply() and apply() lies between the output return.
The output of lapply() is a list.
lapply() can be used for other objects like data frames and lists.
lapply() function does not need MARGIN.
lapply(X, FUN)
# Arguments:
# -X: A vector or an object
# -FUN: Function applied to each element of x
sapply() function does the same jobs as lapply() function but returns a vector.
sapply(X, FUN)
# Arguments:
# -X: A vector or an object
# -FUN: Function applied to each element of x
sapply() function is more efficient than lapply() in the output returned because sapply() store values direclty into a vector. But, it is not always the case.
Function |
Arguments |
Objective |
Input |
Output |
---|---|---|---|---|
apply |
apply(x, MARGIN, FUN) |
Apply a function to the rows or columns or both |
Data frame or matrix |
vector, list, array |
lapply |
lapply(X, FUN) |
Apply a function to all the elements of the input |
List, vector or data frame |
list |
sapply |
sappy(X FUN) |
Apply a function to all the elements of the input |
List, vector or data frame |
vector or matrix |
We can use lapply() or sapply() interchangeable to slice a data frame
The function tapply() computes a measure (mean, median, min, max, etc..) or a function for each factor variable in a vector
tapply(X, INDEX, FUN = NULL)
# Arguments:
# -X: An object, usually a vector
# -INDEX: A list containing factor
# -FUN: Function applied to each element of x
Mnemonics
Data could exist in various formats. For each format R has a specific function and argument.
One of the most widely data store is the .csv (comma-separated values) file formats.
R loads an array of libraries during the start-up, including the utils package.
This package is convenient to open csv files combined with the reading.csv() function
read.csv(file, header = TRUE, sep = ",")
# argument:
# -file: PATH where the file is stored
# -header: confirm if the file has a header or not, by default, the header is set to TRUE
# -sep: the symbol used to split the variable. By default, `,`.
If your .csv file is stored locally, you can replace the PATH inside the code snippet. Don't forget to wrap it inside ' '. The PATH needs to be a string value.
read.csv(file, header = TRUE, sep = ",")
# argument:
# -file: PATH where the file is stored
# -header: confirm if the file has a header or not, by default, the header is set to TRUE
# -sep: the symbol used to split the variable. By default, `,`.
Excel files are very popular among data analysts. Spreadsheets are easy to work with and flexible. R is equipped with a library readxl to import Excel spreadsheet
read_excel(PATH, sheet = NULL, range= NULL, col_names = TRUE)
# arguments:
# -PATH: Path where the excel is located
# -sheet: Select the sheet to import. By default, all
# -range: Select the range to import. By default, all non-null cells
# -col_names: Select the columns to import. By default, all non-null columns
We can find out which sheets are available in the workbook by using excel_sheets() function
Use n_max argument to return n rows
Use range argument combined with cell_rows or cell_cols
# Read the first five row: with header
iris <-read_excel(example, n_max =5, col_names =TRUE)
If we change col_names to FALSE, R creates the headers automatically
# Read the first five row: without header
iris_no_header <-read_excel(example, n_max =5, col_names =FALSE)
In the data frame iris_no_header, R created five new variables named
X__1, X__2, X__3, X__4 and X__5
# Read rows A1 to B5
example_1 <-read_excel(example, range = "A1:B5", col_names =TRUE)
dim(example_1)
# Read rows 1 to 5
example_2 <-read_excel(example, range =cell_rows(1:5),col_names =TRUE)
dim(example_2)
iris_row_with_header <-read_excel(example, range =cell_rows(2:3), col_names=TRUE)
iris_row_no_header <-read_excel(example, range =cell_rows(2:3),col_names =FALSE)
# Select columns A and B
col <-read_excel(example, range =cell_cols("A:B"))
dim(col)
read_excel() returns NA when a symbol without numerical value appears in the cell. We can count the number of missing values with the combination of two functions
sum
is.na
iris_na <-read_excel(example, na ="setosa")
sum(is.na(iris_na))
There are various R packages that can be used to communicate with RDBMS, each with a different level of abstraction.
Some of these packages have the functionality of copying entire data frames to and from databases.
Some of the packages available on CRAN for importing data from Relational Database are:
RMySQL/RMariaDB
RODBC
ROracle
RPostgreSQL
RSQLite (This packages is used for bundled DBMS SQLite)
RJDBC (This package uses Java and can connect to any DBMS with a JDBC driver)
PL/R
RpgSQL
RMongo (This is an R interface for Java Client with MongoDB)
RMySQL package is an interface to MySQL DBMS. The current version of this package requires DBI package to be pre-installed
dbGetQuery sends the queries and fetches results as the data frame.
dbSendQuery only sends the query and returns an object of class inheriting from “DBIResult”, this object of class can be used to fetch the required result.
dbClearResult removes the result from cache memory.
fetch returns few or all rows that were asked in query. The output of fetch function is a list.
dbHasCompleted is used to check is all the rows are retrieved.
dbReadTable and dbWriteTable functions are used to read and write the tables in Database from an R data frame.
When we want to import data into R, it is useful to implement following checklist. It will make it easy to import data correctly into R
The typical format for a spreadsheet is to use the first rows as the header (usually variables name).
Avoid to name a dataset with blank spaces; it can lead to interpreting as a separate variable. Alternatively, prefer to use '_' or '-.'
Short names are preferred
Do not include symbol in the name: i.e: exchange_rate_$_€ is not correct. Prefer to name it: exchange_rate_dollar_euro
Use NA for missing values otherwise; we need to clean the format later.
utils |
Read CSV file |
read.csv() |
file, header =,TRUE, sep = "," |
readxl |
Read EXCEL file |
read_excel() |
path, range = NULL, col_names = TRUE |
haven |
Read SAS file |
read_sas() |
path |
haven |
Read STATA file |
read_stata() |
path |
haven |
Read SPSS fille |
read_sav() |
path |
Library
Objective
Function
Default Arguments
read_excel() |
Read n number of rows |
n_max = 10 |
Select rows and columns like in excel |
range = "A1:D10" |
|
Select rows with indexes |
range= cell_rows(1:3) |
|
Select columns with letters |
range = cell_cols("A:C") |
Function
Objectives
Arguments
Missing values in data science arise when an observation is missing in a column of a data frame or contains a character value instead of numeric value. Missing values must be dropped or replaced in order to draw correct conclusion from the data
The fourth verb in the dplyr library is helpful to create new variable or change the values of an existing variable.
We will proceed in two parts. We will learn how to:
mutate(df, name_variable_1 = condition, ...)
# arguments:
# -df: Data frame used to create a new variable
# -name_variable_1: Name and the formula to create the new variable
# -...: No limit constraint. Possibility to create more than one variable inside mutate()
The na.omit() method from the dplyr library is a simple way to exclude missing observation. Dropping all the NA from the data is easy but it does not mean it is the most elegant solution. During analysis, it is wise to use variety of methods to deal with missing values
We could also impute(populate) missing values with the median or the mean. A good practice is to create two separate variables for the mean and the median. Once created, we can replace the missing values with the newly formed variables
We have three methods to deal with missing values:
To export data to the hard drive, we need the file path and an extension.
save the data directly into the working directory
directory <-getwd()
directory
write.csv(df, path)
# arguments
# -df: Dataset to save. Need to be the same name of the data frame in the environment.
# -path: A string. Set the destination path. Path + filename + extension i.e. "/Users/USERNAME/Downloads/mydata.csv" or the filename + extension if the folder is the same as the working directory
library(xlsx)
write.xlsx(df, "file-name.xlsx")
the library xlsx uses Java to create the file. Java needs to be installed if not present in your machine
save(x,file-name)
# arguments:
# x = Variable
# file-name = File name ".RData"
Save a data frame or any other R object, using save() function
install.packages("googledrive")
Install googledrive library to access the function allowing to interact with Google Drive
drive_upload(file, path = NULL, name = NULL)
# arguments:
# - file: Full name of the file to upload (i.e., including the extension)
# - path: Location of the file- name: You can rename it as you wish. By default, it is the local name.
In the Rstudio's console, you can see the summary of the step done. Google successfully uploaded the file located locally on the Drive. Google assigned an ID to each file in the drive.
drive_browse("table_car")
drive_download(file, path = NULL, overwrite = FALSE)
# arguments:
# - file: Name or id of the file to download
# -path: Location to download the file. By default, it is downloaded to the working directory and the name as in Google Drive
# -overwrite = FALSE: If the file already exists, don't overwrite it. If set to TRUE, the old file is erased and replaced by the new one.
Upload a file from Google Drive with the ID is convenient. We can get the ID by the file name
base |
Export csv |
write.csv() |
xlsx |
Export excel |
write.xlsx() |
haven |
Export spss |
write_sav() |
haven |
Export sas |
write_sas() |
haven |
Export stata |
write_dta() |
base |
Export R |
save() |
googledrive |
Upload Google Drive |
drive_upload() |
googledrive |
Open in Google Drive |
drive_browse() |
googledrive |
Retrieve file ID |
drive_get(as_id()) |
googledrive |
Dowload from Google Drive |
download_google() |
googledrive |
Remove file from Google Drive |
drive_rm() |
rdrop2 |
Authentification |
drop_auth() |
rdrop2 |
Create a folder |
drop_create() |
rdrop2 |
Upload to Dropbox |
drop_upload() |
rdrop2 |
Read csv from Dropbox |
drop_read_csv |
rdrop2 |
Delete file from Dropbox |
drop_delete() |
Library
Objective
Function
Summary of a variable is important to have an idea about the data
Before we perform summary, we will do the following steps to prepare the data:
Use the glimpse() function to have an idea about the structure of the dataset
The syntax of summarise() is basic and consistent with the other verbs included in the dplyr library
summarise(df, variable_name=condition)
# arguments:
# - `df`: Dataset used to construct the summary statistics
# - `variable_name=condition`: Formula to create the new variable
group_by works perfectly with all the other verbs (i.e. mutate(), filter(), arrange(), ...)
Basic |
mean() |
Average of vector x |
|
median() |
Median of vector x |
|
sum() |
Sum of vector x |
variation |
sd() |
standard deviation of vector x |
|
IQR() |
Interquartile of vector x |
Range |
min() |
Minimum of vector x |
|
max() |
Maximum of vector x |
|
quantile() |
Quantile of vector x |
Position |
first() |
Use with group_by() First observation of the group |
|
last() |
Use with group_by(). Last observation of the group |
|
nth() |
Use with group_by(). nth observation of the group |
Count |
n() |
Use with group_by(). Count the number of rows |
|
n_distinct() |
Use with group_by(). Count the number of distinct observations |
Objective
Function
Description
The function summarise() is compatible with subsetting.
Another useful function to aggregate the variable is sum().
Spread in the data is computed with the standard deviation or sd() in R
Access the minimum and the maximum of a vector with the function min() and max().
Count observations by group is always a good idea. With R, we can aggregate the the number of occurence with n()
Select the first, last or nth position of a group
The function nth() is complementary to first() and last(). We can access the nth observation within a group with the index to return
The function n() returns the number of observations in a current group. A closed function to n() is n_distinct(), which count the number of unique values
A summary statistic can be realized among multiple groups
Before we intend to do an operation, we can filter the dataset
We need to remove the grouping before we want to change the level of the computation
We don't necessarily need all the variables, and a good practice is to select only the variables you find relevant
#- `select(df, A, B ,C)`: Select the variables A, B and C from df dataset.
#- `select(df, A:C)`: Select all variables from A to C from df dataset.
#- `select(df, -C)`: Exclude C from the dataset from df dataset.
The filter() verb helps to keep the observations following a criteria
filter(df, condition)
# arguments:
# - df: dataset used to filter the data
# - condition: Condition used to filter the data
The creation of a dataset requires a lot of operations, such as:
The dplyr library comes with a practical operator, %>%, called the pipeline. The pipeline feature makes the manipulation clean, fast and less prompt to error
The arrange() verb can reorder one or many rows, either ascending (default) or descending
# - `arrange(A)`: Ascending sort of variable A
# - `arrange(A, B)`: Ascending sort of variable A and B
# - `arrange(desc(A), B)`: Descending sort of variable A and ascending sort of B
Graphs are an incredible tool to simplify complex analysis
Graphs are the third part of the process of data analysis. The first part is about data extraction, the second part deals with cleaning and manipulating the data. At last, we need to visualize our results graphically.
ggplot2 is very flexible, incorporates many themes and plot specification at a high level of abstraction. With ggplot2, we can't plot 3-dimensional graphics and create interactive graphics
ggplot(data, mapping=aes()) +
geometric object
# arguments:
# data: Dataset used to plot the graph
# mapping: Control the x and y-axis
# geometric object: The type of plot you want to show. The most common object are:
# - Point: `geom_point()`
# - Bar: `geom_bar()`
# - Line: `geom_line()`
# - Histogram: `geom_histogram()`
One solution to make our data less sensitive to outliers is to rescale them
We can add another level of information to the graph. We can plot the fitted value of a linear regression.
Graphs need to be informative and good labels. We can add labels with labs()function
lab(title = "Hello Fathi")
# argument:
# - title: Control the title. It is possible to change or add title with:
# - subtitle: Add subtitle below title
# - caption: Add caption below the graph
# - x: rename x-axis
# - y: rename y-axis
# Example:lab(title = "Hello Fathi", subtitle = "My first plot")
The library ggplot2 includes eights themes:
ggsave("my_fantastic_plot.png")
Store graph right after we plot it
Box plot helps to visualize the distribution of the data by quartile and detect the presence of outliers
Bar chart is a great way to display categorical variables in the x-axis. This type of graph denotes two aspects in the y-axis.
ggplot(data, mapping = aes()) +
geometric object
# arguments:
# data: dataset used to plot the graph
# mapping: Control the x and y-axis
# geometric object: The type of plot you want to show. The most common objects are:
# - Point: `geom_point()`
# - Bar: `geom_bar()`
# - Line: `geom_line()`
# - Histogram: `geom_histogram()`
# - `stat`: Control the type of formatting. By default, `bin` to plot a count in the y-axis. For continuous value, pass `stat = "identity"`
# - `alpha`: Control density of the color
# - `fill`: Change the color of the bar
# - `size`: Control the size the bar
Four arguments can be passed to customize the graph
Represent the group of variables with values in the y-axis
install.packages("leaflet")
# to install the development version from Github, run
# devtools::install_github("rstudio/leaflet")
R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents
For more details on using R Markdown see http://rmarkdown.rstudio.com.
Some advantages of using R Markdown:
R code can be embedded in the report, so it is not necessary to keep the report and R script separately. Including the R code directly in a report provides structure to analyses.
The report text is written as normal text, so no knowledge of HTML coding is required.
The output is an HTML file that includes pictures, code blocks R output and text. No additional files are needed, everything is incorporated in the HTML file. So easy to send the report via mail, or place it as paper package on your website.
These HTML reports enhance collaboration: It is much more easy to comment on an analysis when the R code, the R output and the plots are available in the report.
Open R Studio, then go to
File →→ New file →→ R Markdown.
*italic*
**bold**
_italic_
__bold__
# Header 1
## Header 2
### Header 3
#### Header 4
##### Header 5
###### Header 6
1. Item 1
2. Item 2
3. Item 3
+ Item 3a
+ Item 3b
* Item 1
* Item 2
+ Item 2a
+ Item 2b
```{r}
summary(cars$dist)
summary(cars$speed)
```
There were `r nrow(cars)` cars studied
> Put some quote here
[title](http://www.google.com)
```
Some text here
```
knitr::kable(dataset)
!Fotia Logo](img/logo kuda-01.png)
```{r fig.width=1, fig.height=10,echo=FALSE}
library(png)
library(grid)
img <- readPNG("img/logo kuda-01.png")
grid.raster(img)
```
The argument echo specifies whether the R commands are included (default is TRUE). Adding echo=FALSE in the opening line of the R code block will not include the commmand:
```{r, echo=FALSE}
```{r plot2, fig.width = 8, fig.height = 4, echo=FALSE}
par(mfrow(1,3))
plot(cars)
image(volcano, col.terrain.colors(50))
```
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot
Although no HTML coding is required, HTML could optionally be used to format the report. The HTML tags are interpreted by the browser after the .Rmd file is converted into an HTML file. Some examples:
Shiny is a means of creating web applications entirely in R.
The client-server communication, HTML, layout and JavaScript programming is entirely handled by Shiny.
This makes creating web applications feasible for those who are not necessarily experienced web-developers
install.packages("shinydashboard")
library(shinydashboard)
## ui.R ##
library(shinydashboard)
dashboardPage(
dashboardHeader(),
dashboardSidebar(),
dashboardBody()
)
A dashboard has three parts:
a header, a sidebar, and a body.
## app.R ##
library(shiny)
library(shinydashboard)
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(),
dashboardBody()
)
server <- function(input, output) { }
shinyApp(ui, server)
dashboardHeader(title = "My Dashboard")
Setting the title is simple;
just use the title argument
Links in the sidebar can be used like tabPanels from Shiny. That is, when we click on a link, it will display different content in the body of the dashboard.
## ui.R ##
sidebar <- dashboardSidebar(
sidebarMenu(
menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")),
menuItem("Widgets", icon = icon("th"), tabName = "widgets",
badgeLabel = "new", badgeColor = "green")
)
)
body <- dashboardBody(
tabItems(
tabItem(tabName = "dashboard",
h2("Dashboard tab content")
),
tabItem(tabName = "widgets",
h2("Widgets tab content")
)
)
)
# Put them together into a dashboardPage
dashboardPage(
dashboardHeader(title = "Simple tabs"),
sidebar,
body
)
There are no secrets to success. It is the result of preparation, hard work, and learning from failure. - Colin Powell