foundations of data science for everyone

dr.federica bianco | fbb.space |    fedhere |    fedhere 
III: probability and statistics part 2

 statistics

1

statistics

takes us from observing a limited number of samples to infer on the population

TAXONOMY

Population: all of the elements of a "family" or class

Sample: a finite subset of the population that you observe

distributions: a collection of numbers with a specific shape

observational approach: a distribution represent the frequency with which we obtain a value ~x when measuring a phenomenon

HISTOGRAMS OF SAMPLES

number of values

measured between

x and x+dx

analyst approach: a distribution represent the probability with which a phenomenon generates a value that we measure to be ~x

frequency

probability

distributions: a collection of numbers with a specific shape

observational approach: a distribution represent the frequency with which we obtain a value ~x when measuring a phenomenon

number of values

measured between

x and x+dx

analyst approach: a distribution represent the probability with which a phenomenon generates a value that we measure to be ~x

distributions: a collection of numbers with a specific shape

observational approach: a distribution represent the frequency with which we obtain a value ~x when measuring a phenomenon

number of values

measured between

x and x+dx

analyst approach: a distribution represent the probability with which a phenomenon generates a value that we measure to be ~x

frequency

probability

most common distribution:

 well behaved mathematically, symmetric, when we can we will assume our uncertainties or samples are Gaussian distributed

 

Gaussian (or normal) distribution

support: the x values for which the distribution is defined

most common distribution:

 well behaved mathematically, symmetric, when we can we will assume our uncertainties or samples are Gaussian distributed

 

Gaussian (or normal) distribution

support: the x values for which the distribution is defined

most common distribution:

 well behaved mathematically, symmetric, when we can we will assume our uncertainties or samples are Gaussian distributed

 

Gaussian (or normal) distribution

Central tendency

TAXONOMY

central tendency: mean, median, mode

spread         : variance, interquantile range

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments and quantiles

a distribution’s moments summarize its properties:

 

 

central tendency: mean (n=1)

 

 

Gaussian (or Normal) distribution

Quantiles

measure what fraction of a distribution is within some x values

 

central tendency: median (50%)

 

 

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments 

a distribution’s moments summarize its properties:

 

 

central tendency: mean (n=1)

 

 

Gaussian (or Normal) distribution

Quantiles

measure what fraction of a distribution is within some x values

 

central tendency: median (50%)

 

 

Normal

Chi-squared

Moments 

a distribution’s moments summarize its properties:

 

 

 

 

 

symmetric distribution: mean=median

skewed distribution: mean=median

Quantiles

most common distribution:

 well behaved mathematically, symmetric, when we can we will assume our uncertainties or samples are Gaussian distributed

 

Gaussian (or normal) distribution

Spread

Normal

Moments

a distribution’s moments summarize its properties:

 

 

spread: variance (n=2)

              standard deviation

 

 

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx
m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments

Quantiles

measure what fraction of a distribution is within some x values

 

central tendency: quartiles (25%-75%)

 

 

\sigma = \sqrt{\mathrm{variance}}

25%-75%

Normal

Moments and quantiles

a distribution’s moments summarize its properties:

 

 

spread: variance (n=2)

              standard deviation

 

 

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx
m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments and quantiles

Quantiles

measure what fraction of a distribution is within some x values

 

central tendency: quantiles (5%-95%, 1%-99%...)

 

 

\sigma = \sqrt{\mathrm{variance}}

5%-95%

Coin toss:

 

Parameters: p, n

fair coin:  p=0.5 n=1

Vegas coin: p=0.5 n=1

Binomial

I bet heads:

head = success

"given n tosses, each with a probability of 0.5 to get head"

Support: integer positive

Mean: np

Variance: np*(1-p)

most common distribution:

 well behaved mathematically, symmetric, when we can we will assume our uncertainties or samples are Gaussian distributed

 

Gaussian (or normal)

distribution

 

 

Parameters: μ, σ

 

Support: natural numbers

Mean: μ

Variance: σ

Shut noise/count noise

 

The innate noise in natural steady state processes (star flux, rain drops...)

Poisson

Support: natural numbers

Mean: λ

Variance: λ

 

 

Parameters: λ

 

Moments and frequentist probability

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

a distribution’s moments summarize its properties:

 

 

 

 

central tendency: mean (n=1), median, mode (peak)

spread: standard deviation (variance n=2), quantiles

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Normal

Chi-squared

coding time!

are they the same? 

questions we need statistics to answer

2

Preamble: kinds of descriptive questions

  • What is the highest/lowest value?
    • what is the most viewed video?
    • what is the average number of views?
videos["views"].mean()
videos["views"].median()

--> 2360784.6382573447
--> 681861.0

Preamble: kinds of analytical questions

  • Are two measurements the same?
    • is the amount of rain in Wilmington this year the same as last year?

 

 

 

  • Are two distributions the same?
    • is the age distribution of CityBike users the same among registered Male customers and registered Female customers?

measuring differences between distributions

are these distributions the same?

if distributions have the same  measured means within 1 (or n) standard deviation they should be considered "the same"

 

means: 1,2

standard deviation: 1,1

standard dev.

interquartile range (25%-75%)

mean

measurement uncertainty

and measurement samples

2.1

  • Are two measurements the same?
    • is the amount of rain in Wilmington this year the same as last year?
    • are two pencils the same length

 

 

uncertainty because of the limitations of the measuring tool

  • Are two measurements the same?
    • is the amount of rain in Wilmington this year the same as last year?
    • are two pencils the same length

 

 

uncertainty because of the limitations of the measuring tool

Take N measurements, they will all be a bit different

 

number of values

measured between

x and x+dx

  • Are two measurements the same?
    • is the amount of rain in Wilmington this year the same as last year?
    • are two pencils the same length

 

 

uncertainty because of the limitations of the measuring tool

Take N measurements, they will all be a bit different

 

number of values

measured between

x and x+dx

  • Are two measurements the same?
    • is the amount of rain in Wilmington this year the same as last year?
    • are two pencils the same length

 

 

uncertainty because of the limitations of the measuring tool

Take N measurements, they will all be a bit different

 

number of values

measured between

x and x+dx

  • Are two measurements the same?
    • are two countries just as happy?
    • are two pencils the same length?

 

 

intrinsic variance in the phenomenon

Take N measurements, they will all be a bit different

 

number of values

measured between

x and x+dx

Take N measurements, they will all be a bit different

 

number of values

measured between

x and x+dx

KEY CONCEPT:

 

the larger the number of samples from the distribution the more similar the distribution of our sample is to the actual "generative process": i.e. the histogram will look more and more like the actual distribution curve

Take N measurements, they will all be a bit different

 

number of values

measured between

x and x+dx

KEY CONCEPT:

 

the larger the number of samples from the distribution the more similar the distribution of our sample is to the actual "generative process": i.e. the histogram will look more and more like the actual distribution curve

THEREFORE

It is easier to tell if two distributions are the same when the samples are large

Take N measurements, they will all be a bit different

 

number of values

measured between

x and x+dx

KEY CONCEPT:

 

the larger the number of samples from the distribution the more similar the distribution of our sample is to the actual "generative process": i.e. the histogram will look more and more like the actual distribution curve

THEREFORE

It is easier to tell if two distributions are the same when the samples are large

THEREFORE

I need statistical tests that acknowledge the size of the sample when I compare distributions

Law of Large Numbers

As the size of a _____________ tends to infinity the mean of the sample tends to the mean of the _______________

Laplace (1700s) but also: Poisson, Bessel, Dirichlet, Cauchy, Ellis

Let X1...XN be an N-elements sample from a population whose distribution has
mean μ and standard deviation σ
In the limit of N -> infinity

the sample mean x approaches a Normal (Gaussian) distribution with mean μ and standard deviation σ

regardless of the distribution of X

Central Limit Theorem

\bar{x} ~\sim~ N\left(\mu, \sigma/\sqrt{N}\right)

Easy way to assess if two number are the same:

Are the mean farther than the standard deviations?

standard deviation: 1,1

standard dev.

interquartile range (25%-75%)

mean

the principle of Falsifiability

3

3 General principles of "good" science

Falisifiability

Parsimony

Reproducibility

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

science hypotheses needs to be falsifiable

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

science hypotheses needs to be falsifiable

I need to see only 1 black swan to tell that the statement that all swans are white is not true. But even if I dont see a black one it does not mean all swans are white

But what happens when I have distributions of measurements?

the demarcation problem:

science hypotheses needs to be falsifiable

Beyond any reasonable doubt

same concept guides prosecutorial justice

guilty beyond reasonable doubt

 

in a probabilistic sense, all hypotheses we make are possible

 

We will reject a hypothesis if its probability is lower than a predefined threshold

hyposthesis testing

 

4

hyposthesis testing

 

4

We do not "prove our hypothesis" 

 

we falsify the opposite of our hypothesis

Null

Hypothesis

Rejection

Testing

formulate your prediction

Null Hypothesis

 

1

the pencils are the NOT same length (tho i think they are)

the earth is NOT round (spoiler alert... it is!!)

The NULL hypothesis is typically what I want to reject: its the way I think the world does NOT work

Null

Hypothesis

Rejection

Testing

Alternative Hypothesis

 

2

if all alternatives to our model are ruled out, then our model must hold 

 

 

P(A) + P(\bar{A}) = 1

identify all alternative outcomes

 

The ALTERNATIVE hypothesis is the complete opposite of the NULL

Null

Hypothesis

Rejection

Testing

confidence level

 

p-value

 

     threshold

3

set confidence threshold

 
\alpha
2\sigma\\ 0.05\\ 95\%

95%

Null

Hypothesis

Rejection

Testing

4

find a measurable quantity which under the Null has a known distribution

 

pivotal quantities

Null

Hypothesis

Rejection

Testing

4

find a measurable quantity which under the Null has a known distribution

 

pivotal quantities

if the Null hypothesis holds

Null

Hypothesis

Rejection

Testing

4

find a measurable quantity which under the Null has a known distribution

 

pivotal quantities

For example, it follows a Gaussian Distribution with mean 0 and standard deviation 1

 

Null

Hypothesis

Rejection

Testing

4

find a measurable quantity which under the Null has a known distribution

 

pivotal quantities

if a quantity follows a known distribution, once I measure its value I can what the probability of getting that value actually is! was it a likely or an unlikely draw?

if my quantity is ~20 its very likely

if my quantity is ~35 its very unlikely

pivotal quantities

quantities that under the Null Hypotheis follow a known distribution

p(\mathrm{pivotal~quantity} | NH)~\sim~p(NH|D)

Null

Hypothesis

Rejection

Testing

4

pivotal quantities

quantities that under the Null Hypotheis follow a known distribution

also called "statistics"

e.g.:  Z statistics: difference between means ~N(0,1)

         χ2 statistics: difference between prediction and reality squared ~

         K-S statistics: maximum distance of cumulative distributions

Null

Hypothesis

Rejection

Testing

4

Z-test

The distribution of sample means for (independent) samples extracted from a population

with mean μ and standard deviation σ is

Normally distributed

 

 

 

\bar{X} \sim N(\mu=0, \sigma=1)

Z-test

If I measure the mean of two samples 

(the samples of pencil measurements)

I expect the difference to be a number drawn from a standard normal: Gaussian w mean 0 variance 1

Highest prob. is 0

Prob that the number is within 1σ of the mean is 68%

\bar{X} \sim N(\mu=0, \sigma=1)

Z-test

\bar{X} \sim N(\mu=0, \sigma=1)

If the means are within 1-sigma of each other that means that I cannot rule out that the distributions are the same to a 1-sigma level (p-value 0.32)

If the means are not 1-sigma of each other that means that I can rule out that the distributions are the same to a 1-sigma level (p-value 0.32)

reject the null

Z-test

Is the mean of a sample with known variance the same as that of a  known population?

pivotal quantity

Z = (\bar{X} − μ_0) / s

sample mean

sample variance = 

population mean

Z ~\sim~ N(\mu=0,~\sigma=1)
\sigma_0/\sqrt{n}

Z-test

Is the mean of a sample with known variance the same as that of a  known population?

pivotal quantity

sample mean

sample variance = 

population mean

Z ~\sim~ N(\mu=0,~\sigma=1)

The Z test -meaning the statistic is ~N(0,1)-

provides a trivial interpretation of the measured quantity:

the Z value is exactly the distance for the mean of the standard distribution of possible outcomes in units of standard deviation

 

so a result of 0.13 means we are 0.13 standard deviations to the mean (p>0.05)

Z = (\bar{X} − μ_0) / s

sample variance = 

population mean

\sigma_0/\sqrt{n}

Z-test

Is the mean of a sample with known variance the same as that of a  known population?

pivotal quantity

sample mean

sample variance = 

population mean

Z ~\sim~ N(\mu=0,~\sigma=1)

why do we need a test? why not just measuring the means and seeing it they are the same?

Z = (\bar{X} − μ_0) / s

sample variance = 

population mean

\sigma_0/\sqrt{n}

Null

Hypothesis

Rejection

Testing

5

calculate it!

 

pivotal quantities

Null

Hypothesis

Rejection

Testing

6

test data against alternative outcomes

 
\quad\quad\mathrm{what~is}\quad\alpha?

95%

α is the x value corresponding to a chosen threshold

if its a Z test the number i get is the distance from the mean in units of standard deviation

Null

Hypothesis

Rejection

Testing

6

test data against alternative outcomes

 

prediction is unlikely

Null rejected

Alternative holds

p(NH|D)~<~\alpha

Null

Hypothesis

Rejection

Testing

6

test data against alternative outcomes

 

prediction is unlikely

Null rejected

Alternative holds

p(NH|D)~<~\alpha

this corresponds to measuring a valu of the statistics in the tail of the distribution

Null

Hypothesis

Rejection

Testing

6

test data against alternative outcomes

 
p(NH|D)~<~\alpha
p(NH|D)~\geq~\alpha

prediction is likely

Null holds for now

 

prediction is unlikely

Null rejected

Alternative holds

95%

Null

Hypothesis

Rejection

Testing

6

test data against alternative outcomes

 
p(NH|D)~<~\alpha
p(NH|D)~\geq~\alpha

prediction is likely

Null holds for now

 

prediction is unlikely

Null rejected

Alternative holds

95%

data

does not falsify alternative

falsifies alternative

model

holds

"Under the Null Hypothesis" = if the proposed model is false

this has a low probability of happening

model

prediction

everything but model is rejected

low probability event happened

formulate the Null as the comprehensive opposite of your theory

Key Slide

formulate your prediction (NH)

1

2

identify all alternative outcomes (AH)

 

3

set confidence threshold

(p-value)

 

4

find a measurable quantity which under the Null has a known distribution

(pivotal quantity)

 

5

6

calculate the pivotal quantity

 

calculate probability of value obtained for the pivotal quantity under the Null

 

if probability < p-value : reject Null

 

Key Slide

p-value hypothesis testing

5

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

 

what is the probability that I would measure 1.5?

16%

16%

The probability of measuring any one value is mathematically 0... however I can say that

the probability of measuring something between -1σ and 1σ  (within 1-sigma) is 68%.

So the probability of measuring something outside is 100-68 = 32%. 

So if I measure something outside of [-:] that had a probability <32% of being measured.

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

 

what is the probability that I would measure 1.5?

The probability of measuring any one value is mathematically 0... however I can say that

the probability of measuring something between -2σ and 2σ  (within 2-sigma) is 95%.

So the probability of measuring something outside is 100-95 = 5%. 

So if I measure something outside of [-2σ:2σ] that had a probability <5% of being measured.

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

 

what is the probability that I would measure 1.5?

The probability of measuring any one value is mathematically 0... however I can say that

the probability of measuring something between -3σ and 3σ  (within 3-sigma) is 99.7%.

So the probability of measuring something outside is 100-99.7 = 0.3%. 

So if I measure something outside of [-3σ:3σ] that had a probability <0.3% of being measured.

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

 

what is the probability that I would measure 1.5?

it might be easier to think about it as cumulative distributions if you are comfortable with integrals

the probability of measuring something between -3σ and 3σ  (within 3-sigma) is 99.7%.

So the probability of measuring something outside is 100-99.7 = 0.3%. 

So if I measure something outside of [-3σ:3σ] that had a probability <0.3% of being measured.

Moments and frequentist probability

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 
  2. Identify what you expect your measurement's distribution to be if the Null hypothesis holds
  3. Measure your outcome from the data x
  4. If x is outside of the area of "reasonable doubt" under the Null hypothesis => the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

in the falsification framework: p-value

Distribution of measurements under the Null hypothesie

Moments and frequentist probability

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 
  2. Identify what you expect your measurement's distribution to be if the Null hypothesis holds
  3. Measure your outcome from the data x
  4. If x is outside of the area of "reasonable doubt" under the Null hypothesis => the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

in the falsification framework: p-value

Moments and frequentist probability

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 
  2. Identify what you expect your measurement's distribution to be if the Null hypothesis holds
  3. Measure your outcome from the data x
  4. If x is outside of the area of "reasonable doubt" under the Null hypothesis => the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

in the falsification framework: p-value

Null hypothesis cannot be rejected at a

p-value 0.05

p-value hypothesis testing

step by step

6

NHRT: p-value

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 

Null Hypothesis Rejection Testing

set up threshold α

its important to do this first. If we do not we may be tempted to choose a threshold that fits our result, thus always reporting rejection of null hypothesis

NHRT: p-value

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 
  2. Identify what you expect your measurement's distribution to be if the Null hypothesis holds

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

NHRT: p-value

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 
  2. Identify what you expect your measurement's distribution to be if the Null hypothesis holds
  3. Measure your outcome from the data x; extract the appropriate statistics from a set of data (e.g. mean, median...)

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

measure outcome from data: x0

NHRT: p-value

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 
  2. Identify what you expect your measurement's distribution to be if the Null hypothesis holds
  3. Measure your outcome from the data x; extract the appropriate statistics from a set of data (e.g. mean, median...)
  4. If x is outside of the area of "reasonable doubt" under the Null hypothesis the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

falsify H0

measure outcome from data: x0

H0 cannot be falsified

p(|x|>|x_0|)<\alpha \implies
p(|x|>|x_0|)\geq \alpha \implies

NHRT: p-value

  1. Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05 
  2. Identify what you expect your measurement's distribution to be if the Null hypothesis holds
  3. Measure your outcome from the data x; extract the appropriate statistics from a set of data (e.g. mean, median...)
  4. If x is outside of the area of "reasonable doubt" under the Null hypothesis the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

falsify H0

measure outcome from data: x0

H0 cannot be falsified

p(|x|>|x_0|)<\alpha \implies

this quantity is called a "statistics"

p(|x|>|x_0|)\geq \alpha \implies

p-value hypothesis testing

common tests

6

Statistical way to measure differences:

 

In NHRT a statistics is a quantity that relates to the data which has a known distribution under the Null Hypothesis

In NHRT a statistics is a quantity that relates to the data which has a known distribution under the Null Hypothesis

e.g.: Z statistics  is Normally distributed Z~N(0,1)

In absence of effect (i.e. under the Null)

​== the sample mean is the same as the population mean

Z is distributed according to a Gaussian N(μ=0, σ=1)

Z = \frac{\mu - \bar{x}}{\sigma/\sqrt{N}}

Does a sample come from a known population?

Example: new bus route implementation.

https://github.com/fedhere/PUS2022_FBianco/blob/master/classdemo/ZtestBustime.ipynb

You know the mean and standard deviation of a but travel route: that is the population

You measure the new travel time between two stops 10 times: that is your sample. 

Has travel time changed?

95%

Statistics that follow a Standard Normal distribution

Z -test

In absence of effect (i.e. under the Null)

​== the sample mean is the same as the population mean

Z is distributed according to a Gaussian N(μ=0, σ=1)

Z = \frac{\mu - \bar{x}}{\sigma/\sqrt{N}}

95%

The expectation is that Z will be distributed following a standard normal: a Gaussian with mean 0 std 1.

 

Values away from 0 are increasingly less probable. 

68% probability to get a number b/w  -1 and +1 

95% probability to get a number b/w  -2 and +2 

How to interpret the number you get?

Statistics that follow a Standard Normal distribution

Does a sample come from a known population?

Z -test

In absence of effect (i.e. under the Null)

​== the sample mean is the same as the population mean

Z is distributed according to a Gaussian N(μ=0, σ=1)

Z = \frac{\mu - \bar{x}}{\sigma/\sqrt{N}}

95%

IF OUR p-value THRSHOLD IS 1-sigma that means the 68% region is between -1 and +1

 

=> We have less 68% probability of getting a number <-1 or > 1

How to interpret the number you get?

Statistics that follow a Standard Normal distribution

Does a sample come from a known population?

Z -test

In absence of effect (i.e. under the Null)

== the proportions of men and women are the same

Z is distributed according to a Gaussian N(μ=0, σ=1)

Are 2 proportions (fractions) the same? 

Example: citibike women usage patterns

https://github.com/fedhere/PUS2020_FBianco/blob/master/classdemo/citibikes_gender.ipynb

You want to know if women are less likely than man to use citibike to commute.

You know the fraction of rides women (men) take during the week

95%

Z = \frac{(p_0 - p_1)}{SE} \\

Statistics that follow a Standard Normal distribution

Z -test

In absence of effect (i.e. under the Null)

== the proportions of men and women are the same

Z is distributed according to a Gaussian N(μ=0, σ=1)

p =\frac{p_0 n_0 + p_1 n_1}{n_0+n_1}\\ SE = \sqrt{ p ( 1 - p ) (\frac{1}{n_0} + \frac{1}{n_1}) }\\ \\

95%

Z = \frac{(p_0 - p_1)}{SE} \\

Statistics that follow a Standard Normal distribution

Are 2 proportions (fractions) the same? 

Z -test

K-S test

Kolmogorof-Smirnoff :

do two samples come from the same parent distribution?

pivotal quantity

d_{12} ≡ max_{x}\left|C_1(x)~−~C_2(x)\right|

Cumulative distribution 1

Cumulative distribution 2

P (d > observed)~=2\sum_{j=1}^\infty (-1)^{j-1}e^{-2j^2x^2} \sqrt{\frac{N_1N_2}{N_1+N_2} D}

K-S test

Kolmogorof-Smirnoff :

do two samples come from the same parent distribution?

pivotal quantity

d_{12} ≡ max_{x}\left|C_1(x)~−~C_2(x)\right|

Cumulative distribution 1

Cumulative distribution 2

P (d > observed)~=2\sum_{j=1}^\infty (-1)^{j-1}e^{-2j^2x^2} \sqrt{\frac{N_1N_2}{N_1+N_2} D}
\chi^{2} ≡ \sum_i{\frac{({f(x_i) - y_i})^2}{\sigma_i^2}}

uncertainty

model

\chi^2~\sim~\chi^2(df=n-1)

observation

number of observation

this should actually be the number of params in the model

How to use statistical tables

χ2 test

are the data what is expected from the model (if likelihood is Gaussian... we'll see this later) - ther are a few χ2 tests. The one here is the "Pearson's χ2 tests"

χ2 test

pivotal quantity

\chi^{2} ≡ \sum_i{\frac{({f(x_i) - y_i})^2}{\sigma_i^2}}

uncertainty

model

\frac{\chi^2}{n-1}~\sim~\chi^2(df=1)

observation

number of observation

are the data what is expected from the model (if likelihood is Gaussian... we'll see this later) - ther are a few χ2 tests. The one here is the "Pearson's χ2 tests"

How to use statistical tables

Statistics that do not follow a Standard Normal distribution

In absence of effect (i.e. under the Null)

== the samples are drawn from the same population

The KS test is chi-square distributed

Are 2 samples the same? 

KS-test

In absence of effect (i.e. under the Null)

== the samples are drawn from the same population

The KS test is chi-square distributed

Statistics that do not follow a Standard Normal distribution

pivotal quantity

d_{12} ≡ max_{x}\left|C_1(x)~−~C_2(x)\right|

Cumulative distribution 1

Cumulative distribution 2

P (d > observed)~=2\sum_{j=1}^\infty (-1)^{j-1}e^{-2j^2x^2} \sqrt{\frac{N_1N_2}{N_1+N_2} d_{12}}

Are 2 samples the same? 

KS-test

Statistics that do not follow a Standard Normal distribution

pivotal quantity

d_{12} ≡ max_{x}\left|C_1(x)~−~C_2(x)\right|

Cumulative distribution 1

Cumulative distribution 2

P (d > observed)~=2\sum_{j=1}^\infty (-1)^{j-1}e^{-2j^2x^2} \sqrt{\frac{N_1N_2}{N_1+N_2} d_{12}}

Are 2 samples the same? 

KS-test

P (d > observed)~=2\sum_{j=1}^\infty (-1)^{j-1}e^{-2j^2x^2} \sqrt{\frac{N_1N_2}{N_1+N_2} D}

Statistics that do not follow a Standard Normal distribution

pivotal quantity

d_{12} ≡ max_{x}\left|C_1(x)~−~C_2(x)\right|

Cumulative distribution 1

Cumulative distribution 2

Are 2 samples the same? 

KS-test

Statistics and tests

t = \frac{|\bar{x_a} - \bar{x_b}|}{S_{AB}\sqrt{\frac{1}{N_A} + \frac{1}{N_B}}}

Types of Data:

Data Definitions

Data:             observations that have been collected
Population: the complete body of subjects we want to infer about

Sample:        the subset of the population about which data is collected/available

 

Census:        collection of data from the entire population

 

Parameter:  the subset of the population we actually studied collection of data from                            the entire population

Statistics:     numerical value describing an attribute of the population numerical                                  value describing an attribute of the sample

Data Definitions

The analysis of our ______

showed that for our 10 _________ the mean income is $60k. The standard deviation of the ______ means is $12k. From this _______ we infer for the _____________ a mean income _________ $60k +/- $12k

data

sample

statistics

population

parameter

data kinds and nomenclature

7

At the root is the fact that  a sample drawn from a parent distribution will look increasingly more like the parent distribution as the size of the sample increases.

 

More formally: The distribution of the means of N samples generated from the same parent distribution will

 

I. be normally distributed (i.e. will be a Gaussian)

 

II. have mean equal to the mean of the parent distribution, and

 

III. have standard deviation equal to the parent population standard deviation divided by the square root of the sample size

 

 

 

Qualitative variables

No ordering

                  UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

 

 

Types of Data:

Qualitative variables

No ordering

                  UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

 

 

Types of Data:

Qualitative variables

No ordering

                  UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

                 Time, Distance, Age, Length, Intensity, Satisfaction, Number of

 

 

Types of Data:

Qualitative variables

No ordering

                  UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

                 Time, Distance, Age, Length, Intensity, Satisfaction, Number of

 

 

Counts:

number of people in a county

Ordinal:

survey response Good/Fair/Poor

discrete

Types of Data:

Qualitative variables

No ordering

                  UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

                 Time, Distance, Age, Length, Intensity, Satisfaction, Number of

 

 

continuous

Counts:

number of people in a county

Ordinal:

survey response Good/Fair/Poor

Continuous

Ordinal:

Earthquakes (notlinear scale)

Interval:

F temperature interval size preserved

Ratio:

Car speed

0 is naturally defined

 

discrete

Types of Data:

Qualitative variables

No ordering

                  UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

                 Time, Distance, Age, Length, Intensity, Satisfaction, Number of

 

 

continuous

Counts:

number of people in a county

Ordinal:

survey response Good/Fair/Poor

Interval:

F temperature interval size preserved

Ratio:

Car speed

0 is naturally defined

 

discrete

 Censored: age>90
 

Missing:  “Prefer not to answer” (NA / NaN)

Types of Data:

Continuous

Ordinal:

Earthquakes (notlinear scale)

key   concepts

descriptive statistics

 

null hypothesis rejection testing setup

 

pivotal quantities 

 

Z,  K-S tests

 

 

READING

Foundations of DS for everyone - III

By federica bianco

Foundations of DS for everyone - III

Foundations of Data Science for Everyone - Probability and Statistics

  • 513