R Script
Introduction and Practice -1
Shawn Chen 2016.8.13
R語言簡介
面對撲面而來的資料浪潮,包含 Google、Facebook、Intel、Pfizer、Bank of America 等國際級企業,都已經採用 R 語言進行資料分析,許多全球一流大學如 Stanford、Johns Hopkins 和 UCLA 也將 R 視為資料分析課程的先修科目。R 語言具有免費、跨平台、佔有率高、可塑性高等優勢,各式各樣的 R 社群蓬勃發展。在國際知名的 KDnuggets 論壇統計當中,R 語言已經連續三年獲得資料科學家最常使用的資料分析語言第一名。

IEEE Spectrum Rank

Let's try it!
If you do not finish RStudio downloading, use this link.

RStudio Interface

Try it!

> x <- 10
> y <- "IBM"
>
Try it!

> x <- 10
> y <- "IBM"
> z <- rnorm(1000, mean = 100, sd = 5)
>
rnorm?

> z <- rnorm(1000, mean = 100, sd = 5)
> ?rnorm
>
rnorm

> z <- rnorm(1000, mean = 100, sd = 5)
> ?rnorm
> z
rnorm

> z <- rnorm(1000, mean = 100, sd = 5)
> ?rnorm
> z
> hist(z)
>
Exercise 1
Display the histogram of normal distribution with 1000 numbers, average 10 and standard deviation 10
Exercise 1

> hist(rnorm(1000, mean = 10, sd = 10))
>
R Data Structure
R's Basic Data Types
- Integer
- Numeric
- Complex
- Character
- Logical
Data Types
> class(x)
[1] "numeric"
> class(y)
[1] "character"
>

Data Types
> class(x)
[1] "numeric"
> class(z)
[1] "numeric"
> class(z)
[1] "numeric"
>

Data Types
> x <- 10
> class(x)
[1] "numeric"
> x <- as.integer(x)
> class(x)
[1] "integer"
>

Data Types
> x <- 10
> class(x)
[1] "numeric"
> x <- as.integer(x)
> class(x)
[1] "integer"
> y <- as.integer(y)
Warning message:
NAs introduced by coercion

General Data Structures
- Vector
- Matrix
- Array
- List
- Data Frame
Vector
The basic data object in R,
consisting of one or more values of
a single data type.
Matrix
A two-dimensional of a single data type.
Array
A multi-dimensional object of a single data type.
Data Frame
A special kind of named list where all elements has the same length.
List
A list can contain (multi) dimensional objects of any data type.
Vector
Matrix
Array
DataFrame
List
Practice
Create Vector
> V <- c(10, 5, 3, 1, 0)
> class(V)
[1] "numeric"
>

Create Vector
> V <- c(10, 5, 3, 1, 0)
> class(V)
[1] "numeric"
> V <- as.integer(V)
> class(V)
[1] "integer"
>

Create Vector
> V2 <- c(1, 2, NA, NA, 5)
> V2[1]
[1] 1
> V2[4]
[1] NA
>

Create Array
> A <- 1:24
> dim(A) <- c(3, 4, 2)
> A
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
, , 2
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22
[2,] 14 17 20 23
[3,] 15 18 21 24
>

Create Array
> A <- array(1:24, c(3, 4, 2))
> A
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
, , 2
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22
[2,] 14 17 20 23
[3,] 15 18 21 24
>

Create DataFrame
> age <- c(27, 18, 25, 40, 25)
> sex <- c("Male", "Female", "Female", "Male", "Female")
> name <- c("Shawn", "Luna", "Asu", "Alex", "Claire")
> X <- data.frame(id, age, sex, name)
> X
id age sex name
1 1 27 Male Shawn
2 2 18 Female Luna
3 3 25 Female Asu
4 4 40 Male Alex
5 5 25 Female Claire
>

Edit DataFrame
> X <- edit(X)
>


Edit DataFrame
> X$age[4] <- 39
> X
id age sex name
1 1 27 Male Shawn
2 2 18 Female Luna
3 3 25 Female Asu
4 4 39 Male Alex
5 5 25 Female Claire
>


Exercise2
> install.packages("swirl")
> library(swirl)
> install_course_github("shiyoubun","CDL-TW_BDA-R")
> swirl()
>


R Data Import/Export
Working Directory
> setwd("/Users/shawn/r language/workspace/demo/")
> getwd()
[1] "/Users/shawn/r language/workspace/demo"
>

Working Directory
> setwd("/Users/shawn/r language/workspace/demo/")
> getwd()
[1] "/Users/shawn/r language/workspace/demo"
> setwd("folder")
> getwd()
[1] "/Users/shawn/r language/workspace/demo/folder"
>

Import CSV files
csv download
Download csv to working directory
> getwd()
[1] "/Users/shawn/r language/workspace/demo/folder"
>

read.csv
> Y <- read.csv("city-of-chicago-salaries.csv")
> View(Y)
>


write.csv
> write.csv(Y,"output.csv")
> View(Y)
>

write.csv
> write.csv(Y,"output.csv")
> View(Y)
>


write.csv
> write.csv(Y,"output2.csv", row.names=FALSE)
> View(Y)
>

Let's do some data process

aggregate
> ?aggregate
>

Remove $ in data frame
> Z <- Y
> Z$Employee.Annual.Salary = as.numeric(gsub("[\\$,]","",Z$Employee.Annual.Salary))
New dataframe
> AGGR <- aggregate(Z$Employee.Annual.Salary,
by = list(Z$Position.Title), FUN = mean)
> View(AGGR)
>

New dataframe with column name
> AGGR <- aggregate(Z$Employee.Annual.Salary,
by = list(Z$Position.Title), FUN = mean)
> View(AGGR)
> AGGR <- setNames(AGGR, c("Position.Title",
"Annual.Salary"))
> View(AGGR)
>

Draw histrogram
> hist(AGGR$Annual.Salary)
>

Draw histrogram
> hist(AGGR$Annual.Salary, main="Histogram",
xlab = "Salary")
>

Write to csv file
> write.csv(AGGR,"survey.csv",row.names = FALSE)
>


Try it!
- plot()
- boxplot()
plot
boxplot


Exercise3
> install.packages("swirl")
> library(swirl)
> install_course_github("shiyoubun","CDL-TW_BDA-R")
> swirl()
>



deck
By Chen Hsiang-wen
deck
- 567