Data Cleaning 資料清理

陳達泓 Denny

denny20700@gmail.com

2019.12.02

使用tidyr整理資料

資料分析流程:

Let's Go!

什麼是整齊的資料 tidy data ?

1. 每個變數(variable)都必須要有自己的資料欄

2. 每個觀察(observation)都必須要有自己的資料列

3. 每個值(value)都必須要有自己的資料格

這次我們首先會用到 gather 以及 spread

#呼叫
library(tidyr) ; # require(tidyr)
#他也同時是tidyverse這個套件的一員，我們呼叫tidyverse就好
library(tidyverse)

在未呼叫套件時使用函式

#指定套件呼叫函式
tidyr::gather() ; tidyr::spread()

1. 找出變數與觀察值在哪裡

2. 解決兩個常見的問題之一:

一個變數有可能分散到多個資料欄

一個觀察有可能散布在多個資料列

猜猜哪一個要用到gather and spread

Answer click me

大家可以先在 RStudio 上面輸入

🔻

cat("table1  \n") ; table1  
cat("table2  \n") ; table2
cat("table3  \n") ; table3
cat("table4a \n") ; table4a
cat("table4b \n") ; table4b 
cat("table5  \n") ; table5

> cat("table1  \n") ; table1
table1
# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
> cat("table2  \n") ; table2
table2
# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
1 Afghanistan  1999 cases             745
2 Afghanistan  1999 population   19987071
3 Afghanistan  2000 cases            2666
4 Afghanistan  2000 population   20595360
5 Brazil       1999 cases           37737
6 Brazil       1999 population  172006362
7 Brazil       2000 cases           80488
8 Brazil       2000 population  174504898
9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
> cat("table3  \n") ; table3
table3
# A tibble: 6 x 3
  country      year rate            
* <chr>       <int> <chr>          
1 Afghanistan  1999 745/19987071    
2 Afghanistan  2000 2666/20595360  
3 Brazil       1999 37737/172006362
4 Brazil       2000 80488/174504898
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
> cat("table4a \n"); table4a
table4a
# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
> cat("table4b \n"); table4b
table4b
# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583
> cat("table5  \n") ; table5
table5
# A tibble: 6 x 4
  country     century year  rate            
* <chr>       <chr>   <chr> <chr>          
1 Afghanistan 19      99    745/19987071    
2 Afghanistan 20      00    2666/20595360  
3 Brazil      19      99    37737/172006362
4 Brazil      20      00    80488/174504898
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

Gathering 聚集

寬資料轉長資料

tidyr::gather()

Usage
gather(data, key = "key", value = "value", ..., na.rm = FALSE,
  convert = FALSE, factor_key = FALSE)

Argument
key. value : Names of new key and value columns, as strings or symbols.

key:就是你想把原本是colnames的變成values的那一欄名稱

value:原本column對應的值所產生的新欄位名稱

key

value

table4a
# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

tidyr::gather()

Usage
gather(data, key = "key", value = "value", ..., na.rm = FALSE,
  convert = FALSE, factor_key = FALSE)

gather(table4a, `1999`, `2000`, key = "Year", value = "cases")
gather(table4a, "key", "value", -country)

# A tibble: 6 x 3
  country     Year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Brazil      1999   37737
3 China       1999  212258
4 Afghanistan 2000    2666
5 Brazil      2000   80488
6 China       2000  213766

key:就是你想把原本是colnames的變成values的那一欄名稱

value:原本column對應的值所產生的新欄位名稱

table4a
# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Spreading 分散

長資料轉寬資料

tidyr::spread()

Usage
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

Argument
key. value : Column names or positions. This is passed to tidyselect::vars_pull().

key: 含有應該要是變數名稱的資料欄

value: 來自多個變數的值的資料欄

> table2
# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

key

value

tidyr::spread()

Usage
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

key: 含有應該要是變數名稱的資料欄

value: 來自多個變數的值的資料欄

> table2
# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

spread(table2, key = type, value = count)

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Exercises

people <- tribble(
  ~name,              ~key,      ~value,
  "Phillip Woods",    "age",     45,
  "Phillip Woods",    "height",  185,
  "Phillip Woods",    "age",     50,
  "Jessica Cordero",  "age",     37,
  "Jessica Cordero",  "height",  156,
)

這個tibble應該要gather 還是 spread?

有沒有辦法執行? why? 怎麼解決?

Solutions

people %>%　group_by(name) %>%
 distinct(key, .keep_all = T) %>%
  spread(key, value)

# A tibble: 2 x 3
# Groups:   name [2]
  name              age height
  <chr>           <dbl>  <dbl>
1 Jessica Cordero    37    156
2 Phillip Woods      45    185

drop duplicated or create a new column

Tips

people %>% group_by(name, key) %>%
  mutate(obs = row_number()) %>%
  spread(key, value)

# A tibble: 3 x 4
# Groups:   name [2]
  name              obs   age height
  <chr>           <int> <dbl>  <dbl>
1 Jessica Cordero     1    37    156
2 Phillip Woods       1    45    185
3 Phillip Woods       2    50     NA

Exercises

preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes", NA, 10,
  "no", 20, 12
)

這個tibble應該要gather 還是 spread?

Solutions

gather(preg,male, female, key = "sex", value = "count")

# A tibble: 4 x 3
  pregnant sex    count
  <chr>    <chr>  <dbl>
1 yes      male      NA
2 no       male      20
3 yes      female    10
4 no       female    12

我們完成處理 table2 & table4 了!
但還有 table3。
table3 有什麼問題呢?
: 其中一個資料欄 (rate) 含有兩個變數(case & population)

table3
# A tibble: 6 x 3
  country      year rate            
* <chr>       <int> <chr>          
1 Afghanistan  1999 745/19987071    
2 Afghanistan  2000 2666/20595360  
3 Brazil       1999 37737/172006362
4 Brazil       2000 80488/174504898
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

case / population

這次我們首先會用到 separate 以及 unite

Separate 分離

一個資料欄拆成多個資料欄

Usage
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)

table3 %>% separate(col = rate, into = c("cases", "population"))
table3 %>% separate(rate, into = c("cases", "population"), sep = "/")

# A tibble: 6 x 4
  country      year cases  population
  <chr>       <int> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

table3
# A tibble: 6 x 3
  country      year rate            
* <chr>       <int> <chr>          
1 Afghanistan  1999 745/19987071    
2 Afghanistan  2000 2666/20595360  
3 Brazil       1999 37737/172006362
4 Brazil       2000 80488/174504898
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

tidyr::separate()

tidyr::separate()

Usage
separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)

table3 %>% separate(year, into = c("century", "year"), sep = 2)

table3
# A tibble: 6 x 3
  country      year rate            
* <chr>       <int> <chr>          
1 Afghanistan  1999 745/19987071    
2 Afghanistan  2000 2666/20595360  
3 Brazil       1999 37737/172006362
4 Brazil       2000 80488/174504898
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

# A tibble: 6 x 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

Unite 合併

多個資料欄合成一個資料欄

tidyr::unite()

Usage
unite(data, col, ..., sep = "_", remove = TRUE)
Argument

col  The name of the new column, as a string or symbol.

...   A selection of columns. If empty, all variables are selected.

table5 %>% unite(new, century, year)

# A tibble: 6 x 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

# A tibble: 6 x 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 19_99 745/19987071     
2 Afghanistan 20_00 2666/20595360    
3 Brazil      19_99 37737/172006362  
4 Brazil      20_00 80488/174504898  
5 China       19_99 212258/1272915272
6 China       20_00 213766/1280428583

tidyr::unite()

Usage
unite(data, col, ..., sep = "_", remove = TRUE)
Argument

col  The name of the new column, as a string or symbol.

...   A selection of columns. If empty, all variables are selected.

table5 %>% unite(new, century, year, sep = "")

# A tibble: 6 x 4
  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

# A tibble: 6 x 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

Missing Value 遺失值

Missing Value

一個值有兩種缺失方式 :

1. 明確的: 以NA標示

2. 隱含的: 沒有出現在資資料中

stocks <- tibble(
  year  = c(rep(2015,4), rep(2016, 3)),
  qtr = c(1:4, 1:3),
  return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)

stocks %>% spread(year, return)
stocks %>% spread(year, return) %>% 
  gather(`2015`, `2016`, key = year, value = return)
stocks %>% spread(year, return) %>% 
  gather(`2015`, `2016`, key = year, value = return, na.rm = T)
stocks %>% complete(year, qtr)

treatment <- tribble(
  ~ person,           ~treatment, ~response,
  "Derrick Whitmore", 1,          7,
  NA,                 2,          10, 
  NA,                 3,          9, 
  "Katherine Burke",  1,          4
)
treatment %>% fill(person)

na.rm : NA remove

complete() : 接受一組資料欄，並會找出所有獨特的組合，然後它會確保原本的資料集中含有所有的那些值，並在必要時填入NA。

fill()：他接受一組資料欄，其中妳想要以最近的非缺失值來取代遺失值（上一個觀測值取代）

補充:
spread 裡面的 fill 引數是指想要將NA填補為什麼

使用dplyr的關聯式資料

關連式資料 relational data ?

資料分析很少會涉及單一個資料表，通常你會有很多資料表，而我們必須挑選需要的欄位進行分析。

整體而言，多個資料表被稱為關聯式資料，因為重要的是在資料表之間的關聯。

Mutating joins(變動結合)　&　Filtering joins(過濾結合)

•inner_join

•left_join

•right_join

•full_join

•semi_join

•anti_join

Key 鍵值 : 連接每個資料表的變數

install.packages("nycflights13")
library(nycflights13)

cat("airlines \n") ; airlines
cat("airports \n") ; airports
cat("planes \n") ; planes
cat("weather \n") ; weather
cat("flights \n") ; flights

> cat("airlines \n") ; airlines
airlines 
# A tibble: 16 x 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         
> cat("airports \n") ; airports
airports 
# A tibble: 1,458 x 8
   faa   name                             lat    lon   alt    tz dst   tzone              
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>              
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/New_York   
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/Chicago    
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/Chicago    
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/New_York   
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/New_York   
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/New_York   
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/New_York   
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/New_York   
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/New_York   
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/Los_Angeles
# ... with 1,448 more rows
> cat("planes \n") ; planes
planes 
# A tibble: 3,322 x 9
   tailnum  year type                    manufacturer     model     engines seats speed engine   
   <chr>   <int> <chr>                   <chr>            <chr>       <int> <int> <int> <chr>    
 1 N10156   2004 Fixed wing multi engine EMBRAER          EMB-145XR       2    55    NA Turbo-fan
 2 N102UW   1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 3 N103US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 4 N104UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 5 N10575   2002 Fixed wing multi engine EMBRAER          EMB-145LR       2    55    NA Turbo-fan
 6 N105UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 7 N107US   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 8 N108UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
 9 N109UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
10 N110UW   1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214        2   182    NA Turbo-fan
# ... with 3,312 more rows
> cat("weather \n") ; weather
weather 
# A tibble: 26,115 x 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour          
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>  <dbl>    <dbl> <dbl> <dttm>             
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA      0    1012     10 2013-01-01 01:00:00
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA      0    1012.    10 2013-01-01 02:00:00
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA      0    1012.    10 2013-01-01 03:00:00
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA      0    1012.    10 2013-01-01 04:00:00
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA      0    1012.    10 2013-01-01 05:00:00
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA      0    1012.    10 2013-01-01 06:00:00
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0         NA      0    1012.    10 2013-01-01 07:00:00
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4         NA      0    1012.    10 2013-01-01 08:00:00
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0         NA      0    1013.    10 2013-01-01 09:00:00
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8         NA      0    1012.    10 2013-01-01 10:00:00
# ... with 26,105 more rows
> flights
flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest 
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH  
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH  
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA  
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN  
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL  
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD  
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL  
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD  
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO  
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD  
# ... with 336,766 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

鍵值分兩種:

1. 主鍵: 識別自己資料表中的一個觀察。
            planes$tailnum 唯一識別了planes資料表中的抹一架飛機。

2. 外鍵: 識別其他資料表中的觀察。
            flights$tailnum 出現在flights資料表中，將每一個航班與單一飛機配對。

Let's Go!

Relational Data

變動結合&過濾結合

Usage:
inner_join(x, y, by = NULL, ...)
full_join(x, y, by = NULL, ...)
left_join(x, y, by = NULL, ...)
right_join(x, y, by = NULL, ...)
anti_join(x, y, by = NULL, ...)
semi_join(x, y, by = NULL, ...)

x : lhs (left hand side)
y : rhs (right hand side)
by = .key

Mutating Joins

Filtering Joins

mutating joins

變動結合: 結合兩個資料表的變數

inner join: 只結合lhs rhs都有的
dplyr::inner_join()

left join: 只結合lhs有的
dplyr::left_join()

right join: 只結合rhs有的
dplyr::right_join()

full join: 結合rhs&lhs有出現的
dplyr::full_join()

Duplicated keys

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")

1. One table has duplicate keys.

2. Both tables have duplicate keys.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key")

filtering joins

過濾結合: 匹配結合兩個資料表的觀察

semi join: 保留lhs在rhs匹配的所有觀察
dplyr::semi_join()

anti join: 捨棄lhs在rhs匹配的所有觀察
dplyr::anti_join()

Set Operations

Usage :
intersect(x, y, ...)
union(x, y, ...)
union_all(x, y, ...)
setdiff(x, y, ...)
setequal(x, y, ...)

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(df1, df2)
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)

intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1

# Note that we get 3 rows, not 4
union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     1
#> 3     1     1

setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     1

setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2

集合運算

本周作業~ Homework

請"一次"將iris資料集，以面向(facet)將品種分別作圖，並將其餘數值型變數以密度圖表示，並且以顏色區分四種數值型變數，並且印出標題為各自的系級姓名。

Example

Tidyr Dplyr 12.02 教材

By Chen Ta Hung