SROVNÁNÍ DVOU PRŮMĚRŮ A JEDNODUCHÁ ANALÝZA SOUVISLOSTI

Vít Gabrhel

vit.gabrhel@mail.muni.cz

FSS MU,

9. 10. 2017

Harmonogram

0. Rekapitulace předchozí hodiny

1. Deskriptivní statistiky - doplnění

2. Srovnání dvou průměrů

3. Chí-kvadrát

4. Korelace

Rekapitulace

Skript

# Jakou třídu (class) tvoří obě proměnné?

class(alco_1$Country)
class(alco_1$Litry)

lapply(Alco, class)

# Změňte tuto hodnotu na "NA"

alco_1$Litry[alco_1$Litry == "-99"] <- NA

Alco$Litry <- str_replace(Alco$Litry,-99.00, "NA")

Alco[46,2] = NA

# Jedna z hodnot je evidentně špatně evidovaná. O jakou hodnotu se jedná?

chyby = subset(alco, subset = (Litry < 0))

# V této nové matici ať jsou všechny země napsané velkými písmeny.

Alco_2 [,"Stát"] = toupper(Alco_2[,"Stát"])

Deskriptivní statistiky

Rozšiřující možnosti

setwd()

library("readxl")

talent_scores_sheets = excel_sheets("talent_scores.xlsx")
talent_scores = read_excel("talent_scores.xlsx", sheet = 1)

# Compute the mean of the scores for each student individually
rowMeans(talent_scores[, 2:6])

# Compute the mean of the scores for each course individually
colMeans(talent_scores[, 2:6])

# Compute the score each student has gained for all his courses
rowSums(talent_scores[, 2:6])

# Compute the total score that is gained by the students on each course
colSums(talent_scores[, 2:6])

Deskriptivní statistiky

Rozšiřující možnosti

wm = read.csv2("wm.csv", header = TRUE)

mean(wm$gain) # function: computes the arithmetic mean
mean(wm$gain, na.rm = TRUE) # function: computes the arithmetic mean
median(wm$gain) # function: computes the median
var(wm$gain) # function: computes the variance
sd(wm$gain) # function: computes the standard deviation
min(wm$gain) # function: return the minimum
max(wm$gain) # function: return the maximum

# Summary statistics for all variables - 5 digits
summary(wm, digits = 5)

# Summary statistics for all variables - 10 digits
summary(wm, digits = 10)

Deskriptivní statistiky

Rozšiřující možnosti

library("dplyr")

# Calculate summary statistics for variables containing "ai". Calculate the statistics to 4 significant digits
summary(select(wm, contains("ai")))

# Alternatively, the numSummary() function might be used to obtain some summary statistics. The function computes:

mean= the mean
sd = the standard deviation
iqr = the interquartile range
0% = the minimum
25% = the 1st quantile or the lower quartile
50% = the median
75% = the 3rd quantile or the upper quartile
100%= the maximum
n = the number of observations

library("Rcmdr")
numSummary(wm$gain)

library("Hmisc")
describe(wm)

Korelace

Úvod (dle Pearson product-moment correlation coefficient, n.d.)

Pearson product-moment correlation coeficient

Předpoklady použití:

Alespoň intervalová úroveň měření proměnných
Normálně rozložená data
Homoskedascita

Korelace

base

# Read the variables names
names(talent_scores)

# Create a subset of the dataframe talent, talent_selected, containing reading, english and math (in that order)
talent_selected <- subset(talent_scores, select = c(reading, english, math))

# Předpoklady pro použití

hist(talent_selected$english, main="Histogram for English scores", xlab="Students", border="blue", col="green", xlim=c(0,120), breaks=20)

plot(talent_selected$english, talent_selected$math, main="Scatterplot of Grades", xlab="English ", ylab="Math", pch=19)

qqnorm(talent_selected$math)

Korelace

base

# Compute the correlations among reading, english and math
cor(talent_selected)

#The cor() function does not calculate p-values to test for significance, but the cor.test() function does.
cor.test(talent_selected$english, talent_selected$reading, use = pairwise)
cor.test(talent_selected$reading, talent_selected$math, use = pairwise)
cor.test(talent_selected$english, talent_selected$math, use = pairwise)

Korelace

Rcmdr

# The rcorr.adjust() function of the Rmcdr package computes the correlations with the pairwise p-values among the correlations.

library("Rcmdr")

# Two types of p-values are computed: the ordinary p-values and the adjusted p-values.

?rcorr.adjust

rcorr.adjust(talent_selected)

# Test the significance of the correlations among `english` and `math`
cor.test(talent_selected$english, talent_selected$math, use = pairwise)

Srovnání dvou průměrů (dle Conway, n.d.)

Dependent t-test - úvod

Předpoklady použití:

The sampling distribution is normally distributed. In the dependent t-
test this means that the sampling distribution of the differences between scores should be normal, not the scores themselves.
Data are measured at least at the interval level.

Srovnání dvou průměrů

Dependent t-test - base - argumenty

# Data

wm_t <- subset(wm, wm$train == "1")

# In the case of our dependent t-test, we need to specify these arguments to t.test():

?t.test

# x: Column of wm_t containing post-training intelligence scores
# y: Column of wm_t containing pre-training intelligence scores
# paired: Whether we're doing a dependent (i.e. paired) t-test or # # independent t-test. In this example, it's TRUE
# Note that t.test() carries out a two-sided t-test by default

Srovnání dvou průměrů

Dependent t-test - base - kód

# Conduct a paired t-test using the t.test function
t.test(wm_t$post, wm_t$pre, paired = TRUE)

Output:

Paired t-test

data: wm_t$post and wm_t$pre
t = 14.492, df = 79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.008511 3.966489
sample estimates:
mean of the differences
3.4875

Srovnání dvou průměrů (dle Conway, n.d.)

Dependent t-test - Cohenovo d

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - lsr - argumenty

library("lsr")

# For cohensD(), we'll need to specify three arguments:

# x: Column of wm_t containing post-training intelligence scores
# y: Column of wm_t containing pre-training intelligence scores
# method: Version of Cohen's d to compute, which should be "paired" in this case

?cohensD()

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - lsr - output

# Calculate Cohen's d
cohensD(wm_t$post, wm_t$pre, method = "paired")

[1] 1.620297

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - effsize - argumenty

library("effsize")

cohen.d(x, y, pooled=TRUE, paired=TRUE,
na.rm=FALSE, hedges.correction=FALSE,
conf.level=0.95, noncentral=FALSE)

?cohen.d()

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - effsize - příklad

library("effsize")

cohen.d(wm_t$post,wm_t$pre,pooled=TRUE,paired=TRUE,
na.rm=FALSE, hedges.correction=FALSE,
conf.level=0.95,noncentral=FALSE)

Srovnání dvou průměrů (dle Conway, n.d.)

Independent t-test - úvod

Předpoklady použití:

The sampling distribution is normally distributed.
Data are measured at least at the interval level.
Homogeneity of variance.
Scores are independent (because they come from different people).

Srovnání dvou průměrů

Independent t-test - data

# View the wm_t dataset
wm_t

# Create subsets for each training time
wm_t08 <- subset(wm_t, subset = (wm_t$cond == "t08"))
wm_t12 <- subset(wm_t, subset = (wm_t$cond == "t12"))
wm_t17 <- subset(wm_t, subset = (wm_t$cond == "t17"))
wm_t19 <- subset(wm_t, subset = (wm_t$cond == "t19"))

# Summary statistics for the change in training scores before and after training
describe(wm_t08)
describe(wm_t12)
describe(wm_t17)
describe(wm_t19)

# Create a boxplot of the different training times
ggplot(wm_t, aes(x = cond, y = gain, fill = cond)) + geom_boxplot()

# Levene's test
leveneTest(wm_t$gain ~ wm_t$cond)

Srovnání dvou průměrů

Independent t-test - base

# Conduct an independent t-test
t.test(wm_t19$gain, wm_t08$gain, var.equal = FALSE)

Welch Two Sample t-test

data: wm_t19$gain and wm_t08$gain
t = 8.9677, df = 34.248, p-value = 1.647e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.287125 5.212875
sample estimates:
mean of x mean of y
5.60 1.35

Srovnání dvou průměrů (dle Conway, n.d.)

Independent t-test - Cohen's d

Srovnání dvou průměrů

Independent t-test - effsize

# Calculate Cohen's d

cohen.d(wm_t19$gain, wm_t08$gain,pooled=TRUE,paired=FALSE,
na.rm=FALSE, hedges.correction=FALSE,
conf.level=0.95,noncentral=FALSE)

Cohen's d

d estimate: 2.835822 (large)
95 percent confidence interval:
inf sup
1.893561 3.778083

Chí-kvadrát (dle Pearson's chi-squared test, n.d.)

Úvod

Předpoklady použití:

Ne méně než 20 % buněk v rámci kontigenční tabulky s hodnotou méně než 5
Nenulová hodnota v každé z buněk v rámci kontingenční tabulky

Chí-kvadrát

Data a gmodels

# Data

gedu_sheets = excel_sheets("gedu.xlsx")
gedu = read_excel("gedu.xlsx", sheet = 1)
gedu$Gender = as.factor(gedu$Gender)
gedu$Edu = as.factor(gedu$Edu)
gedu$Edu2 = as.factor(gedu$Edu2)
levels(gedu$Gender) = c("Muž", "Žena")
levels(gedu$Edu) = c("ZŠ", "SŠ bez maturity", "SŠ s maturitou", "VŠ")
levels(gedu$Edu2) = c("Nižší než VŠ", "VŠ")

# gmodels

library("gmodels")

?CrossTable()

Chí-kvadrát

Kontingenční tabulky

# Generate a cross table of gender and education
Gedu_CT_01 <- CrossTable(gedu$Edu, gedu$Gender)

# Generate a crosstable for gender and education in which only the results for the chi-square test are included, and the row proportions.
Gedu_CT_02 = CrossTable(gedu$Edu, gedu$Gender, prop.c = FALSE, prop.t = FALSE, chisq = TRUE, prop.chisq = FALSE)

# Generate a cross table of gender and fulltime in SPSS format
Gedu_CT_03 = CrossTable(gedu$Edu, gedu$Gender, format = "SPSS")

Chí-kvadrát

Velikost účinku - phí (dle Phi coefficient, n.d.)

library("psych")

Gen = gedu$Gender
Edu2 = gedu$Edu2

table_phi = table(Gen, Edu2)

phi(table_phi, digits = 2)

Chí-kvadrát

Velikost účinku - Cramerovo V (dle Cramér's V, n.d.)

library("psych")

Gen = gedu$Gender
Edu = gedu$Edu

table_CV = table(Gen, Edu)

cramersV(table_CV)

Zdroje

Conway, A. (n.d.) Intro to Statistics with R: Student's T-test. Dostupné online na: https://www.datacamp.com/courses/intro-to-statistics-with-r-students-t-test

Cramér's V. (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V

Effect size (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Effect_size

Pearson's chi-squared test (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

Pearson product-moment correlation coefficient (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Phi coefficient (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Phi_coefficient

Sampling distribution (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Sampling_distribution

Standard error (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Standard_error

Student's t-test (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Student%27s_t-test