Tidy data

male female
2015 18 21
2016 22 19
avg weight at the age of 6 male female
2015 18 21
2016 22 19
male female
2015 18 21
2016 22 19
avg weight at the age of 6

1. clean
2. tidy
3. non-tidy data

data cleaning

blank values

  • unknown

  • not applicable

  • non existent

  • 0, "", []

open time

missing: contains uncertainty!

 

non existent: ok

 

2019-11-01: impossible

#legs

0: ok for snake

 

for tree: not applicable

 

-3: impossible

  • 0.5 vs 50%

  • 132 vs "132"

  • 105% open rate

  • "gmail.com" vs "gmail.com   "

data tidying

clean & flexible

Formal rules

  • column           variable

  • observation           row

  • table            variable type

  • tables are linked

? variable, ? observation

height

mobile phone number

compare observations

combine variables

send_day num_open num_click
ápr. 15. 1000 300
ápr. 16. 15000 500
send_day event_type number
ápr. 15. click 300
ápr. 15. open 1000
ápr. 16. click 500
ápr. 16. open 15000
ggplot(dt, aes(x = date, y = value, col = variable)) + 
    geom_point() + 
    geom_line() + 
    labs(x  = NULL, y = NULL) + 
    theme(legend.title = element_blank())
ggplot(dt, aes(x = date)) + 
    geom_point(aes(y = num_send), col = ems_colors[['green1']]) + 
    geom_line(aes(y = num_send), col = ems_colors[['green1']]) + 
    geom_point(aes(y = num_open), col = ems_colors[['blue1']]) + 
    geom_line(aes(y = num_open), col = ems_colors[['blue1']]) + 
    labs(x = NULL, y = NULL) + 
    geom_point(data = data.table('v' = c('num_send', 'num_open'), 
                                 'date' = as.Date('2017-03-01'),
                                 'y' = 2500),
               mapping = aes(col = v, y = y, shape = NA)) + 
    geom_line(data = data.table('v' = c('num_send', 'num_open'), 
                                 'date' = as.Date('2017-03-01'),
                                 'y' = 2500),
               mapping = aes(col = v, y = y, linetype = NA)) + 
    theme(legend.title = element_blank())
ggplot(dt, aes(x = date, y = num_open / num_send)) + 
    geom_point() + 
    geom_line() + 
    scale_y_continuous(labels = scales::percent_format()) + 
    labs(x = NULL, y = 'open rate')

gather
separate
spread

unite

send_day num_open num_click
ápr. 15. 1000 300
ápr. 16. 15000 500
send_day event_type number
ápr. 15. click 300
ápr. 15. open 1000
ápr. 16. click 500
ápr. 16. open 15000

gather

spread

user birth year spend
Catherine 1995 300
Jácint 1997 500
user century year spend
Catherine 19 95 300
Jácint 19 97 500

separate

unite

user demographic spend
Catherine us_1995 300
Jácint hu_1997 500
user language birth spend
Catherine us 1995 300
Jácint hu 1997 500

separate

unite

Non-tidy data

  • efficiency

  • history

graph
corpus
matrix

        data  

 

tools & usage

Made with Slides.com