SROVNÁNÍ DVOU PRŮMĚRŮ A JEDNODUCHÁ ANALÝZA SOUVISLOSTI

Vít Gabrhel

vit.gabrhel@mail.muni.cz

FSS MU,

9. 10. 2017

Harmonogram

 0. Rekapitulace předchozí hodiny

 

 1. Deskriptivní statistiky - doplnění 

 

 2. Srovnání dvou průměrů

 

 3. Chí-kvadrát

 

 4. Korelace

 

 

Rekapitulace

Skript

# Jakou třídu (class) tvoří obě proměnné?

class(alco_1$Country)
class(alco_1$Litry)

 

lapply(Alco, class)

# Změňte tuto hodnotu na "NA"

alco_1$Litry[alco_1$Litry == "-99"] <- NA

Alco$Litry <- str_replace(Alco$Litry,-99.00, "NA")
Alco[46,2] = NA
# Jedna z hodnot je evidentně špatně evidovaná. O jakou hodnotu se jedná?

chyby = subset(alco, subset = (Litry < 0))

# V této nové matici ať jsou všechny země napsané velkými písmeny.

Alco_2 [,"Stát"] = toupper(Alco_2[,"Stát"])

Deskriptivní statistiky

Rozšiřující možnosti

setwd()

library("readxl")

talent_scores_sheets = excel_sheets("talent_scores.xlsx")
talent_scores = read_excel("talent_scores.xlsx", sheet = 1)

 

# Compute the mean of the scores for each student individually
rowMeans(talent_scores[, 2:6])

 

# Compute the mean of the scores for each course individually
colMeans(talent_scores[, 2:6])

 

# Compute the score each student has gained for all his courses
rowSums(talent_scores[, 2:6])

 

# Compute the total score that is gained by the students on each course
colSums(talent_scores[, 2:6])

Deskriptivní statistiky

Rozšiřující možnosti

wm = read.csv2("wm.csv", header = TRUE)

 

mean(wm$gain) # function: computes the arithmetic mean
mean(wm$gain, na.rm = TRUE) # function: computes the arithmetic mean
median(wm$gain) # function: computes the median
var(wm$gain) # function: computes the variance
sd(wm$gain) # function: computes the standard deviation
min(wm$gain) # function: return the minimum
max(wm$gain) # function: return the maximum

 

# Summary statistics for all variables - 5 digits
summary(wm, digits = 5)

 

# Summary statistics for all variables - 10 digits
summary(wm, digits = 10)

Deskriptivní statistiky

Rozšiřující možnosti

library("dplyr")

# Calculate summary statistics for variables containing "ai". Calculate the statistics to 4 significant digits
summary(select(wm, contains("ai")))

 

# Alternatively, the numSummary() function might be used to obtain some summary statistics. The function computes:

  • mean= the mean
  • sd = the standard deviation
  • iqr = the interquartile range
  • 0% = the minimum
  • 25% = the 1st quantile or the lower quartile
  • 50% = the median
  • 75% = the 3rd quantile or the upper quartile
  • 100%= the maximum
  • n = the number of observations

library("Rcmdr")
numSummary(wm$gain)

 

library("Hmisc")
describe(wm)

Korelace

Úvod (dle Pearson product-moment correlation coefficient, n.d.)

Pearson product-moment correlation coeficient

Předpoklady použití:

  • Alespoň intervalová úroveň měření proměnných
  • Normálně rozložená data
  • Homoskedascita

Korelace

base

# Read the variables names
names(talent_scores)

 

# Create a subset of the dataframe talent, talent_selected, containing reading, english and math (in that order)
talent_selected <- subset(talent_scores, select = c(reading, english, math))

 

# Předpoklady pro použití

hist(talent_selected$english, main="Histogram for English scores", xlab="Students", border="blue", col="green", xlim=c(0,120), breaks=20)

 

plot(talent_selected$english, talent_selected$math, main="Scatterplot of Grades", xlab="English ", ylab="Math", pch=19)

 

qqnorm(talent_selected$math)

 

Korelace

base

# Compute the correlations among reading, english and math
cor(talent_selected)

 

#The cor() function does not calculate p-values to test for significance, but the cor.test() function does.
cor.test(talent_selected$english, talent_selected$reading, use = pairwise)
cor.test(talent_selected$reading, talent_selected$math, use = pairwise)
cor.test(talent_selected$english, talent_selected$math, use = pairwise)

Korelace

Rcmdr

# The rcorr.adjust() function of the Rmcdr package computes the correlations with the pairwise p-values among the correlations.

 

library("Rcmdr")

 

# Two types of p-values are computed: the ordinary p-values and the adjusted p-values.

?rcorr.adjust

rcorr.adjust(talent_selected)
 

 

# Test the significance of the correlations among `english` and `math`
cor.test(talent_selected$english, talent_selected$math, use = pairwise)

Srovnání dvou průměrů (dle Conway, n.d.)

Dependent t-test - úvod

Předpoklady použití:

  • The sampling distribution is normally distributed. In the dependent t-
    test this means that the sampling distribution of the differences between scores should be normal, not the scores themselves.
  • Data are measured at least at the interval level.

Srovnání dvou průměrů

Dependent t-test - base - argumenty

# Data

wm_t <- subset(wm, wm$train == "1")

# In the case of our dependent t-test, we need to specify these arguments to t.test():

?t.test
 
# x: Column of wm_t containing post-training intelligence scores
# y: Column of wm_t containing pre-training intelligence scores
# paired: Whether we're doing a dependent (i.e. paired) t-test or # # independent t-test. In this example, it's TRUE
# Note that t.test() carries out a two-sided t-test by default

Srovnání dvou průměrů

Dependent t-test - base - kód

# Conduct a paired t-test using the t.test function
t.test(wm_t$post, wm_t$pre, paired = TRUE)

Output:

Paired t-test

data:  wm_t$post and wm_t$pre
t = 14.492, df = 79, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 3.008511 3.966489
sample estimates:
mean of the differences
                 3.4875

Srovnání dvou průměrů (dle Conway, n.d.)

Dependent t-test - Cohenovo d

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - lsr - argumenty

library("lsr")

# For cohensD(), we'll need to specify three arguments:

# x: Column of wm_t containing post-training intelligence scores
# y: Column of wm_t containing pre-training intelligence scores
# method: Version of Cohen's d to compute, which should be "paired" in this case

?cohensD()

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - lsr - output

# Calculate Cohen's d
cohensD(wm_t$post, wm_t$pre, method = "paired")

[1] 1.620297

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - effsize - argumenty

library("effsize")

cohen.d(x, y, pooled=TRUE, paired=TRUE,
        na.rm=FALSE, hedges.correction=FALSE,
        conf.level=0.95, noncentral=FALSE)

?cohen.d()

Srovnání dvou průměrů

Dependent t-test - Cohenovo d - effsize - příklad

library("effsize")

cohen.d(wm_t$post,wm_t$pre,pooled=TRUE,paired=TRUE,
        na.rm=FALSE, hedges.correction=FALSE,
        conf.level=0.95,noncentral=FALSE)

Srovnání dvou průměrů (dle Conway, n.d.)

Independent t-test - úvod

Předpoklady použití:

  • The sampling distribution is normally distributed.
  • Data are measured at least at the interval level.
  • Homogeneity of variance.
  • Scores are independent (because they come from different people).

Srovnání dvou průměrů

Independent t-test - data

# View the wm_t dataset
wm_t

 

# Create subsets for each training time
wm_t08 <- subset(wm_t, subset = (wm_t$cond == "t08"))
wm_t12 <- subset(wm_t, subset = (wm_t$cond == "t12"))
wm_t17 <- subset(wm_t, subset = (wm_t$cond == "t17"))
wm_t19 <- subset(wm_t, subset = (wm_t$cond == "t19"))

 

# Summary statistics for the change in training scores before and after training
describe(wm_t08)
describe(wm_t12)
describe(wm_t17)
describe(wm_t19)

 

# Create a boxplot of the different training times
ggplot(wm_t, aes(x = cond, y = gain, fill = cond)) + geom_boxplot()

 

# Levene's test
leveneTest(wm_t$gain ~ wm_t$cond)

Srovnání dvou průměrů

Independent t-test - base

# Conduct an independent t-test
t.test(wm_t19$gain, wm_t08$gain, var.equal = FALSE)

 

Welch Two Sample t-test

data:  wm_t19$gain and wm_t08$gain
t = 8.9677, df = 34.248, p-value = 1.647e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 3.287125 5.212875
sample estimates:
mean of x mean of y
     5.60      1.35 

Srovnání dvou průměrů (dle Conway, n.d.)

Independent t-test - Cohen's d

Srovnání dvou průměrů

Independent t-test - effsize

# Calculate Cohen's d

cohen.d(wm_t19$gain, wm_t08$gain,pooled=TRUE,paired=FALSE,
        na.rm=FALSE, hedges.correction=FALSE,
        conf.level=0.95,noncentral=FALSE)

 

Cohen's d

d estimate: 2.835822 (large)
95 percent confidence interval:
     inf      sup
1.893561 3.778083 

Chí-kvadrát (dle Pearson's chi-squared test, n.d.)

Úvod

Předpoklady použití:

  • Ne méně než 20 % buněk v rámci kontigenční tabulky s hodnotou méně než 5

  • Nenulová hodnota v každé z buněk v rámci kontingenční tabulky

Chí-kvadrát

Data a gmodels

# Data

  • gedu_sheets = excel_sheets("gedu.xlsx")
  • gedu = read_excel("gedu.xlsx", sheet = 1)
  • gedu$Gender = as.factor(gedu$Gender)
  • gedu$Edu = as.factor(gedu$Edu)
  • gedu$Edu2 = as.factor(gedu$Edu2)
  • levels(gedu$Gender) = c("Muž", "Žena")
  • levels(gedu$Edu) = c("ZŠ", "SŠ bez maturity", "SŠ s maturitou", "VŠ")
  • levels(gedu$Edu2) = c("Nižší než VŠ", "VŠ")

 

# gmodels

library("gmodels")

 

?CrossTable()

Chí-kvadrát

Kontingenční tabulky

# Generate a cross table of gender and education
Gedu_CT_01 <- CrossTable(gedu$Edu, gedu$Gender)

 

# Generate a crosstable for gender and education in which only the results for the chi-square test are included, and the row proportions.
Gedu_CT_02 = CrossTable(gedu$Edu, gedu$Gender, prop.c = FALSE, prop.t = FALSE, chisq  = TRUE, prop.chisq = FALSE)

 

# Generate a cross table of gender and fulltime in SPSS format
Gedu_CT_03 = CrossTable(gedu$Edu, gedu$Gender, format = "SPSS")

Chí-kvadrát

Velikost účinku - phí (dle Phi coefficient, n.d.)

library("psych")

 

Gen = gedu$Gender
Edu2 = gedu$Edu2

 

table_phi = table(Gen, Edu2)

 

phi(table_phi, digits = 2)

Chí-kvadrát

Velikost účinku - Cramerovo V (dle Cramér's V, n.d.)

library("psych")

 

Gen = gedu$Gender
Edu = gedu$Edu

 

table_CV = table(Gen, Edu)


cramersV(table_CV)

Zdroje

Conway, A. (n.d.) Intro to Statistics with R: Student's T-test. Dostupné online na: https://www.datacamp.com/courses/intro-to-statistics-with-r-students-t-test

 

Cramér's V. (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V

Effect size (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Effect_size

 

Pearson's chi-squared test (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

 

Pearson product-moment correlation coefficient (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

 

Phi coefficient (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Phi_coefficient

Sampling distribution (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Sampling_distribution

 

Standard error (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Standard_error

Student's t-test (n.d.). In Wikipedia: Staženo dne 10. 10. 2016 z https://en.wikipedia.org/wiki/Student%27s_t-test

Made with Slides.com