foundations of data science for everyone

dr.federica bianco | fbb.space |    fedhere |    fedhere 
II: probability and statistics

Recap

0

Improving Decision Making through Data

Improving Decision Making through Data

It is estimated that the amount of data collected by humans from the beginning of history through 2003 is 5 exabytes. 

Improving Decision Making through Data

It is estimated that the amount of data collected by humans from the beginning of history through 2003 is 5 exabytes. 

Since 2013 humans generate and store ~5 exabytes of data every day

Improving Decision Making through Data

It is estimated that the amount of data collected by humans from the beginning of history through 2003 is 5 exabytes. 

Since 2013 humans generate and store ~5 exabytes of data every day

how do i maximize information from few datapoints

how do i extract the critical informatoin and throw away unnecessary content from big data

Improving Decision Making through Data

Pattern Extraction

Data Collection

Interpretation

Pattern Extraction

Data Science - Kelleher & Tierney

Machine Learning

Stats and Probability

Data Visualization

Computer Science / HPC

Data Wrangling

& Databases

Data Ethics

& Regulation

Domain Expertise

Communication

tech skills

comm skills

additional

skills

Improving Decision Making through Data

Pattern Extraction

Data Collection

Interpretation

1750

punch cards

BC

1975 E.Cobb

Relational DataBases (SQL)

 

information could be accessed without knowing how the information was structured or where it resided in the database.

Improving Decision Making through Data

Pattern Extraction

Data Collection

Interpretation

a natural human tendency!

 

likely an evolutionary advantage

Improving Decision Making through Data

Data Collection

Interpretation

Pattern Extraction

Improving Decision Making through Data

Data Collection

Interpretation

Pattern Extraction

the steps of data-analysis and inference: descriptive and exploratory analysis

import pandas as pd
df = pd.read_csv(file_name)
df.describe()

- how is data organized

- is data complete?

- what are the statistical properties of the data

we will look at the statistical properties: mean, standard deviation, median, quantiles...

- searching for anomalies, trends

- searching for relationships between the measurements (correlation)

Inferential: will patterns hold?

 

Predictive: what will be the temperature in 2030?

 

Causal: is the cause of the heating of the earth? CO2? Solar Cycles? 

 

Data Science Practices

1

3 General principles of "good" science

Falisifiability

Parsimony

Reproducibility

github

Reproducible research means:

 

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!) using the code and raw data provided by the analyst.

reproducibility

allows reproducibility through code distribution

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

assures a result is grounded in evidence

1

#openscience

#opendata

 

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitates scientific progress by avoiding the need to duplicate unoriginal research 

2

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitate collaboration and teamwork

3

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

  • provide raw data and code to reduce it to all stages needed to get outputs
 
  • provide code to reproduce all figures
  • provide code to reproduce all number outcomes

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Correlation does not imply causality!!

2 things may be related because they share a cause but not cause each other:

icecream sales with temperature |death by drowning

with temperature

In the era of big data you may encounter truly spurious correlations

divorce rate in Maine | consumption of Margarine

correlation

what is Data Science

1.2

EXPLORATORY DATA ANALYSIS

 

2

Correlation does not imply causality!!

2 things may be related because they share a cause but not cause each other:

icecream sales with temperature |death by drowning

with temperature

In the era of big data you may encounter truly spurious correlations

divorce rate in Maine | consumption of Margarine

correlation

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

r_{xy} = 1~\mathrm{iff}~y=ax\\ ~\mathrm{maximally~correlated}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

r_{xy} = 1~\mathrm{iff}~y=-ax\\ ~\mathrm{maximally~anticorrelated}

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

not linearly correlated

 

Pearson's coefficient = 0

 

does not mean that x and y are independent! 

\rho_{xy} = 1-\frac{6\sum_{i=1}^N(x_i - y_i)^2}{n(n^2-1)}

Pearson's correlation

Spearman's test

(Pearson's for ranked values)

correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}
r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)
import pandas as pd
df = pd.read_csv(file_name)
df.corr()
\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

import pandas as pd
df = pd.read_csv(file_name)
df.corr()
pl.imshow(vdf.corr(), clim=(-1,1),  cmap='RdBu')
pl.xticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.yticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.colorbar();

<- anticorrelated | correlated ->

probability

2.1

Crush Course in Statistics

freee statistics book: http://onlinestatbook.com/

Introduction to Statistics: An Interactive e-Book

David M. Lane

what are probability and statistics?

Basic Probability Frequentist interpretation

fraction of times something happens

 

probability of it happening

Basic Probability Bayesian interpretation

 

represents a level of certainty relating to a potential outcome or idea:

 

if I believe the coin is unfair (tricked) then even if I get a head and a tail I will still believe I am more likely to get heads than tails

<=>

Basic Probability Frequentist interpretation

fraction of times something happens

 

probability of it happening

<=>

P(E) = frequency of E

P(coin = head) = 6/11 = 0.55

Basic Probability Frequentist interpretation

fraction of times something happens

 

probability of it happening

<=>

P(coin = head) = 51/101 = 0.504

P(E) = frequency of E

P(coin = head) = 6/11 = 0.55

Basic probability arithmetics

Probability Arithmetic

A

B

Rules:

0 <= ~P(A) ~<= 1 \\

Basic probability arithmetics

Probability Arithmetic

A

B

Rules:

\mathrm{if~} \bar{A} \mathrm{~is~the~complement~of~A}
0 <= ~P(A) ~<= 1 \\

(everything but A)

P(A) + P(\bar{A}) = 1\\

Basic probability arithmetics

Probability Arithmetic

\mathrm{if} P(A)\cap P(B) = 0~\mathrm{then}:\\ \quad\quad \quad \quad \quad \quad P(A \mathrm{or} B) = P(A) + P(B)\\ \quad\quad \quad \quad \quad \quad P(A \mathrm{and} B) = P(A) * P(B)\\ \quad\quad \quad \quad \quad \quad P(A|B) = P(A)

disjoint events:

independent probabilities

A

B

Basic probability arithmetics

Probability Arithmetic

\quad\quad \quad \quad \quad \quad P(A | B) = \frac{P(A \cap B)}{P(B)}\\ \quad\quad \quad \quad \quad \quad P(A | B) < P(A)\\ \quad\quad \quad \quad \quad \quad P(A\cap B) = P(A) P(B|A)

A

B

\mathrm{if~} P(A)\cap P(B) > 0~\mathrm{then}:
P(A\cap B)

related events:

dependent probabilities

Basic probability arithmetics

Probability Arithmetic

\quad\quad \quad \quad \quad \quad P(A | B) = \frac{P(A \cap B)}{P(B)}\\ \quad\quad \quad \quad \quad \quad P(A | B) < P(A)\\ \quad\quad \quad \quad \quad \quad P(A\cap B) = P(A) P(B|A)

dependent probabilities

A

B

if P(A)\cap P(B) > 0~\mathrm{then}:
P(A\cup B)

 statistics

2.2

statistics

takes us from observing a limited number of samples to infer on the population

TAXONOMY

Distribution: a formula (a model describing outcomes of measurements)

Population: all of the elements of a "family" or class

Sample: a finite subset of the population that you observe

coding time!

distributions

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

parameters (lambda=1)

A number k (e.g. 1) has some probability of being drawn

The probability depents on the parameters of the distribution λ

If I draw N numbers and plot a histogram of them the histogram will have a specific shape

A distribution is a collection of datapoints whose frequency in the sample corresponds to a known formula

support

distributions

A distribution is a collection of datapoints whose frequency in the sample corresponds to a known formula

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

parameters (lambda=1)

support

A number k (e.g. 1) has some probability of being drawn

The probability depents on the parameters of the distribution λ

If I draw N numbers and plot a histogram of them the histogram will have a specific shape

distributions

N (r | \mu, \sigma) \sim \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(r - \mu)^2}{2\sigma^2}}

parameters (-0.1, 0.9)

support

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

normal or Gaussian

continuous support

Poisson

discrete support (only ints)

(1,+\inf]
[-\inf,+\inf]
m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments and frequentist probability

a distribution’s moments summarize its properties:

 

 

 

 

central tendency: mean (n=1), median, mode

spread: standard deviation/variance (n=2), quartiles range

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Moments and frequentist probability

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

a distribution’s moments summarize its properties:

 

 

 

 

central tendency: mean (n=1), median, mode

spread: standard deviation/variance (n=2), quartiles range

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Moments and frequentist probability

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

a distribution’s moments summarize its properties:

 

 

 

 

central tendency: mean (n=1), median, mode

spread: standard deviation/variance (n=2), quartiles range

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Moments and frequentist probability

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

a distribution’s moments summarize its properties:

 

 

 

 

central tendency: mean (n=1), median, mode

spread: standard deviation/variance (n=2), quartiles range

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Law of Large Numbers

As the size of a _____________ tends to infinity the mean of the sample tends to the mean of the _______________

Laplace (1700s) but also: Poisson, Bessel, Dirichlet, Cauchy, Ellis

Let X1...XN be an N-elements sample from a population whose distribution has
mean μ and standard deviation σ
In the limit of N -> infinity

the sample mean x approaches a Normal (Gaussian) distribution with mean μ and standard deviation σ

regardless of the distribution of X

Central Limit Theorem

\bar{x} ~\sim~ N\left(\mu, \sigma/\sqrt{N}\right)

HW

extra

credits

Probability distributions

Coin toss:

 

fair coin:  p=0.5 n=1

Vegas coin: p=0.5 n=1

Binomial

I bet heads:

head = success

"given n tosses, each with a probability of 0.5 to get head"

Probability distributions

Coin toss:

 

fair coin:  p=0.5 n=1

Vegas coin: p=0.5 n=1

Binomial

Probability distributions

Coin toss:

 

fair coin:  p=0.5 n=1

Vegas coin: p=0.5 n=1

Binomial

Probability distributions

Coin toss:

 

fair coin:  p=0.5 n=1

Vegas coin: p=0.5 n=1

Binomial

Probability distributions

central tendency

np=mean

Coin toss:

 

fair coin:  p=0.5 n=1

Vegas coin: p=0.5 n=1

Binomial

Probability distributions

np(1-p)=variance

Binomial

Coin toss:

 

fair coin:  p=0.5 n=1

Vegas coin: p=0.5 n=1

Probability distributions

Shut noise/count noise

 

The innate noise in natural steady state processes (star flux, rain drops...)

λ=mean

Poisson

Probability distributions

λ=variance

Shut noise/count noise

 

The innate noise in natural steady state processes (star flux, rain drops...)

Poisson

Probability distributions

most common noise:

 well behaved mathematically, symmetric, when we can we will assume our uncertainties are Gaussian distributed

 

Gaussian

Probability distributions

turns out its extremely common

many pivotal quantities follow this distribution and thus many tests are based on this

2

Chi-square (χ2)

 

coding time!

Foundations of DS for everyone - II

By federica bianco

Foundations of DS for everyone - II

Foundations of Data Science for Everyone - Probability and Statistics

  • 456