VKLÁDÁNÍ A ČIŠTĚNÍ DAT, ZJIŠŤOVÁNÍ ZÁKLADNÍCH INFORMACÍ O DATOVÉM SOUBORU

Vít Gabrhel

vit.gabrhel@mail.muni.cz

FSS MU,

2. 10. 2017

Harmonogram

0. Rekapitulace předchozí hodiny

1. Importování dat do R

2. Čištění dat

3. Popisné statistiky

Rekapitulace

Balíčky (dle Quick-R, n.d.)

Packages are collections of R functions, data, and compiled code in a well-defined format.

The directory where packages are stored is called the library.
R comes with a standard set of packages.
Others are available for download and installation.
- Once installed, they have to be loaded into the session to be used.

# get library location

.libPaths()

# see all packages installed
library()

# see packages currently loaded
search()

# nainstaluje konkrétní balíček
install.packages("psych")

# načte konkrétní balíček
library("psych")

Import dat

Obecně

# Zjištění pracovní složky (get working directory)

getwd()

# Nastavení pracovní složky (set working directory)

setwd(".../Data")

nebo

setwd("C:...\\Data")

Import dat

Flat Files (= Prostý dabázový soubor)

= Jednoduchá databáze (většinou tabulka) uložená v textovém souboru ve formě prostého textu (Prostý databázový soubor, n.d.)

.csv (comma-separated values)
.txt

Existence celé řady balíčků odlišených podle preferovaného formátu (.csv, .txt) a míry automatizace (resp. počtu argumentů, které je třeba specifikovat).

Součástí R je balíček "utils":

read.table(sep = "")
read.csv(sep = )
read.csv2(sep = ";")
read.delim(sep = "\t")

?read.table

Import dat

Flat Files - Utils - .csv

# Import Manpower.csv:
Manpower = read.csv("Manpower.csv")

# Print the structure of Manpower

str(Manpower)

# Import Manpower.csv correctly: Manpower2

Manpower2 = read.csv("Manpower.csv", stringsAsFactors = FALSE)

# Check the structure of pools

str(Manpower2)

Import dat

Flat Files - Utils - .txt

Naval_1 = read.delim("Naval_1.txt", header = TRUE)

Naval_2 = read.delim("Naval_2.txt", header = FALSE, col.names = c("Country", "ISO3", "Rank", "Total Naval Assets", "Aircraft Carriers", "Frigates", "Destroyers", "Corvettes", "Submarines", "Patrol Craft", "Mine Warfare Vessels"))

summary(Naval_1)

str(Naval_1)

# Select the country with the least naval assets: Total_Naval_Assets
Total_Naval_Assets <- Naval_1[which.min(Naval_1$Total.Naval.Assets), ]

# Select the country with the most submarines: Submarines
Submarines = Naval_1[which.max(Naval_1$Submarines), 9]

Import dat

Excel - readxl

# Instalace a nahrání balíčku

install.packages("readxl")

library("readxl")

# Dva základní příkazy:

excel_sheets() # Výčet listů v daném excelovském (.xls, .xlsx) souboru

read_excel() # Načtení souboru excelovského formátu

excel_sheets("Resources.xlsx")

Import dat

Excel - readxl

# Read the first sheet of Resources.xlsx:
Population = read_excel("Resources.xlsx", sheet = "Population")

View(Population)

# Read the second sheet of Resources.xlsx:
Airports = read_excel("Resources.xlsx", sheet = 2)

View(Airports)

# Put Population and Airports in a list:
Resources_List = list(Population, Airports)

View(Resources_List)

Import dat

Excel - readxl - col_names

Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names.

# Import the the third Excel sheet of Resources.xlsx (R gives names):
Population2 = read_excel("Resources.xlsx", sheet = 3, col_names = FALSE)
View(Population2)

# Import the the third Excel sheet of Resources.xlsx (specify col_names):
Population3= read_excel("Resources.xlsx", sheet = 3, col_names = c("Country", "ISO3", "Rank", "Population"))
View(Population3)

# Print the summary of Population2

summary(Population2)

# Print the summary of Population3

summary(Population3)

Import dat

Excel - readxl - skip

Another argument that can be very useful when reading in Excel files that are less tidy, is skip.

With skip, you can tell R to ignore a specified number of rows inside the Excel sheets you're trying to pull data from.

Have a look at this example:

Airports2 = read_excel("Resources.xlsx", sheet = 4, skip = 15)

In this case, the first 15 rows in the first sheet of "data.xlsx" are ignored.

Pozor na posunutí matice!

Airports3 = read_excel("Resources.xlsx", sheet = 4, skip = 15, col_names = FALSE)

Import dat

Excel - readxl - slučování listů do jedné matice a chybějící hodnoty

Resources_all <- cbind(Population, Airports[-1:-3])

View(Resources_all)

# Argument [-1:-3] se týká prvních tří sloupců v rámci dané matice

# Remove all rows with NAs from latitude_all

Population_clean = na.omit(Population)

# Print out a summary of Population

summary(Population_clean)

Import dat

SPSS - foreign

# Balíček foreign (základní součást R)

library("foreign")

# K načtení dat z SPSS (.sav, .por) slouží příkaz read.spss()

Aby měla nahraná data povahu data frame, je nutné uvnitř příkazu read.spss() jako argument zadat "to.data.frame = TRUE"

# Načtení dat

demo_1 = read.spss("international.sav", to.data.frame = TRUE)

# Načtení několika prvních řádků

head(demo_1)

Import dat

SPSS - foreign

# Načtení dat

demo_2 = read.spss("international.sav", to.data.frame = TRUE, use.value.labels = FALSE)

# Načtení několika prvních řádků

head(demo_2)

Jak nastavit "value labels" z SPSS jako "factors" v R?

Skrze argument "se.value.labels" v rámci příkazu "read.spss()". Tento argument upřesňuje, zda mají být "value labels" konvertovány do R jako "factors".

Argument je "TRUE by default", výchozím stavem je tedy provedení výše uvedené konverze

Import dat

SPSS - foreign

# Summary demo_2$contint

summary(demo_2$contint)

class(demo_2$contint)

# Konverze demo_2$contint na faktor

demo_2$contint = as.factor(demo_2$contint)

# Summary demo_2$contint znovu

summary(demo_2$contint)

class(demo_2$contint)

Jak nastavit "value labels" z SPSS jako "factors" u dílčích proměnných v R?

Jak nastavit "value labels" z SPSS u "factors" v R u dílčích proměnných?

continents = c("Africa", "Americas", "Asia", "Europe")
demo_2$contint = factor(demo_2$contint, levels = c(1, 2, 3, 4), labels = continents)

summary(demo_2$contint)

Čištění dat

Explorace hrubých dat - base

# Matice

bmi_1 = read_excel("bmi.xlsx", sheet = 2)

# Check the class of bmi

class(bmi_1)

# Check the dimensions of bmi

dim(bmi_1)

# View the column names of bmi

colnames(bmi_1)

# Struktura dat

str(bmi_1)

# Sumarizace

summary(bmi_1)

# Prvních 10 a posledních 10 řádků

head(bmi_1, n = 10)

tail(bmi_1, n = 10)

Čištění dat

Explorace hrubých dat - psych

# Load psych

install.packages("psych")

library("psych")

# Check the structure of bmi, the psych way

describe(bmi_1)

Čištění dat

Explorace hrubých dat - grafy

# Matice

bmi_2 = read_excel("bmi.xlsx", sheet = 3)

bmi_all = cbind(bmi_1, bmi_2[-1])

# Histogram

hist(bmi_1$BMI_1980)

# Scatterplot

plot(bmi_all$BMI_1980, bmi_all$BMI_2000)

Čištění dat

Příprava dat pro analýzu

# Preview Infrastructure with str()

str(Infrastructure)

# Coerce Country to character

Infrastructure$Country <- as.character(Infrastructure$Country)

# Coerce Rank to factor

Infrastructure$Rank <- as.character(Infrastructure$Rank)

# Look at Infrastructure once more with str()

str(Infrastructure)

Infrastructure= read.csv2("Infrastructure.csv")

Čištění dat

Příprava dat pro analýzu - dílčí manipulace se strings

# Load the stringr package

install.packages("stringr")

library("stringr")

# Trim all leading and trailing whitespace

name = c(" Filip ", "Nick ", " Jonathan")

str_trim(name)

# Pad these strings with leading zeros

pad = c("23485W", "8823453Q", "994Z")

str_pad(pad, width = 9, side = "left", pad = "0")

# Print state abbreviations

Population$country

# Make states all uppercase and save result to states_upper

states_upper = toupper(Population$Country)
states_upper

# Make states_upper all lowercase again

states_lower = tolower(Population$Country)
states_lower

Čištění dat

Příprava dat pro analýzu - dílčí manipulace se strings

# Look at the head of Infrastructure

head(Infrastructure)

# Detect all "Republic" in Country

str_detect(Infrastructure$Country, "Republic")

# In the Country column, replace "Republic" with "R"...

Infrastructure$Country <- str_replace(Infrastructure$Country, "Republic", "R")

Čištění dat

Příprava dat pro analýzu - missing values

# Call is.na() on the full social_df to spot all NAs

is.na(social_df)

# Use the any() function to ask whether there are any NAs in the data

any(is.na(social_df))

# View a summary() of the dataset

summary(social_df)

# Call table() on the status column

table(social_df$status)

# Replace all empty strings in status with NA

social_df$status[social_df$status == ""] <- NA

# Print social_df to the console
social_df

# Use complete.cases() to see which rows have no missing values

complete.cases(social_df)

# Use na.omit() to remove all rows with any missing values

na.omit(social_df)

name = c("Jerry", "Beth", "Rick", "Morty")
n_friends = c(0, NA, NA, 2)
status = c("Listening to human music", "Happy Family", "Garage", "")

social_df = data.frame(cbind(name, n_friends, status))

Zdroje

Packages (n.d.) Packages. In Quick-R. Staženo dne 2. 10. 2016 z http://www.statmethods.net/interface/packages.html

Prostý databázový soubor. (n.d.). In Wikipedia. Staženo dne 2. 10. 2016 z https://cs.wikipedia.org/wiki/Prost%C3%BD_datab%C3%A1zov%C3%BD_soubor

PSY532, PSY232 - 3. VKLÁDÁNÍ A ČIŠTĚNÍ DAT, ZJIŠŤOVÁNÍ ZÁKLADNÍCH INFORMACÍ O DATOVÉM SOUBORU

By Vít Gabrhel

PSY532, PSY232 - 3. VKLÁDÁNÍ A ČIŠTĚNÍ DAT, ZJIŠŤOVÁNÍ ZÁKLADNÍCH INFORMACÍ O DATOVÉM SOUBORU

1,310

VKLÁDÁNÍ A ČIŠTĚNÍ DAT, ZJIŠŤOVÁNÍ ZÁKLADNÍCH INFORMACÍ O DATOVÉM SOUBORU

Harmonogram

Rekapitulace

Import dat

Import dat

Import dat

Import dat

Import dat

Import dat

Import dat

Import dat

Import dat

Import dat

Import dat

Import dat

Čištění dat

Čištění dat

Čištění dat

Čištění dat

Čištění dat

Čištění dat

Čištění dat

Zdroje

PSY532, PSY232 - 3. VKLÁDÁNÍ A ČIŠTĚNÍ DAT, ZJIŠŤOVÁNÍ ZÁKLADNÍCH INFORMACÍ O DATOVÉM SOUBORU

More from Vít Gabrhel