principles of Urban Science 3

dr.federica bianco | fbb.space |    fedhere |    fedhere

NHRT

this slide deck:

https://slides.com/federicabianco/pus2022_3

Reading in data
Descriptive statistics (central tendency, spread...)
Extracting descriptive statistics from data

1. overfitting

2. p-value inference

3. mapping in python (intro to geopandas)

quizz: https://forms.gle/3yEUJ23vGr9GXiKx9

descriptive statistics

1

Preamble: kinds of analytical questions

Are two measurements the same?
- is the amount of nitrates in Lums pond same as it was 2 years ago?
Are two distributions the same?
- is the weight of Medicare members signed up for health newsletters the same as that of members who are not signed up?
Can I trust that a number comes from a certain distribution? -> p-value

measuring differences

are these 2 numbers the same?

clearly 1.2 = 1.8

are these 2 numbers the same?

two numbers are never actually the same, but we understand that there are limitations in how well numbers represent reality

1.2+/- 1 = 1.8+/- 1

because the [0.2-2.2] interval overlaps the [0.8-2.8] interval

measuring differences

distribution

All data has some element of randomness either because:

there is randomness in the way it is generated
there is uncertainty in the way it is measured
both (in most cases it's both)

distributions

All data has some element of randomness either because:

there is randomness in the way it is generated
there is uncertainty in the way it is measured
both (in most cases it's both)

we think of data points as a number extracted from a distribution. sometimes we have expectations for that distribution, sometimes we do not.

distributions

observational approach: a distribution represent the frequency with which we obtain a value ~x when measuring a phenomenon

number of values

measured between

x and x+dx

x

analyst approach: a distribution represent the probability with which a phenomenon generates a value that we measure to be ~x

frequency

probability

distributions

observational approach: a distribution represent the frequency with which we obtain a value ~x when measuring a phenomenon

number of values

measured between

x and x+dx

x

analyst approach: a distribution represent the probability with which a phenomenon generates a value that we measure to be ~x

frequency

probability

distributions

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

normal or Gaussian

continuous support

Poisson

discrete support

(1,+\inf]

[-\inf,+\inf]

N (r | \mu, \sigma) \sim \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(r - \mu)^2}{2\sigma^2}}

fraction of draws

distributions

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

normal or Gaussian

continuous support

Poisson

discrete support

(1,+\inf]

[-\inf,+\inf]

parameters (lambda=10)

N (r | \mu, \sigma) \sim \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(r - \mu)^2}{2\sigma^2}}

parameters

support

fraction of draws

distributions

N (r | \mu, \sigma) \sim \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(r - \mu)^2}{2\sigma^2}}

parameters (-0.1, 0.9)

support

P (k | \lambda) \sim \frac{\lambda^k ~e^{-\lambda}}{!k}

normal or Gaussian

continuous support

Poisson

discrete support

(1,+\inf]

[-\inf,+\inf]

parameters (lambda=1)

means: 1, 2

standard deviation: 1, 1

are these distributions the same?

measuring differences between distributions

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

Moments and frequentist probability

a distribution’s moments summarize its properties:

central tendency: mean (n=1), median, mode (peak)

spread: standard deviation/variance (n=2), quartiles

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Normal

Chi-squared

Moments and frequentist probability

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

a distribution’s moments summarize its properties:

central tendency: mean (n=1), median, mode (peak)

spread: standard deviation (variance n=2), quartiles

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Moments and frequentist probability

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

a distribution’s moments summarize its properties:

central tendency: mean (n=1), median, mode (peak)

spread: standard deviation (variance n=2), quartiles

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

Moments and frequentist probability

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

a distribution’s moments summarize its properties:

central tendency: mean (n=1), median, mode (peak)

spread: standard deviation (variance n=2), quartiles

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

measuring differences between distributions

are these distributions the same?

if distributions have the same measured means within 1 (or n) standard deviation they should be considered "the same"

means: 1,2

standard deviation: 1,1

1.5*IQR

IQR interquartile range (25%-75%)

mean

m_n = \int_{-\inf}^{\inf} (x-c)^n f(X) dx

https://github.com/fedhere/PUS2022_FBianco/blob/master/classdemo/ascombesqtet.ipynb

measuring differences between distributions

a distribution’s moments summarize its properties:

central tendency: mean (n=1), median, mode (peak)

spread: standard deviation (variance n=2), quartiles

symmetry: skewness (n=3)

cuspiness: kurtosis (n=4)

p-value hypothesis testing

2

Preamble: kinds of analytical questions

Are two measurements the same?
- is the amount of nitrates in Lums pond same as it was 2 years ago?
Are two distributions the same?
- is the weight of Medicare members signed up for health newsletters the same as that of members who are not signed up?
Can I trust that a number comes from a certain distribution? -> p-value

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

what is the probability that I would measure 1.5?

16%

The probability of measuring any one value is mathematically 0... however I can say that

the probability of measuring something between -1σ and 1σ (within 1-sigma) is 68%.

So the probability of measuring something outside is 100-68 = 32%.

So if I measure something outside of [-1σ:1σ] that had a probability <32% of being measured.

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

what is the probability that I would measure 1.5?

The probability of measuring any one value is mathematically 0... however I can say that

the probability of measuring something between -2σ and 2σ (within 2-sigma) is 95%.

So the probability of measuring something outside is 100-95 = 5%.

So if I measure something outside of [-2σ:2σ] that had a probability <5% of being measured.

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

what is the probability that I would measure 1.5?

The probability of measuring any one value is mathematically 0... however I can say that

the probability of measuring something between -3σ and 3σ (within 3-sigma) is 99.7%.

So the probability of measuring something outside is 100-99.7 = 0.3%.

So if I measure something outside of [-3σ:3σ] that had a probability <0.3% of being measured.

Moments and frequentist probability

Imagine that I take a measurements of a quantity that is expected to be normally distributed with mean 0 and stdev 1

what is the probability that I would measure 1.5?

it might be easier to think about it as cumulative distributions if you are comfortable with integrals

the probability of measuring something between -3σ and 3σ (within 3-sigma) is 99.7%.

So the probability of measuring something outside is 100-99.7 = 0.3%.

So if I measure something outside of [-3σ:3σ] that had a probability <0.3% of being measured.

Moments and frequentist probability

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05
Identify what you expect your measurement's distribution to be if the Null hypothesis holds
Measure your outcome from the data x
If x is outside of the area of "reasonable doubt" under the Null hypothesis => the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

in the falsification framework: p-value

Distribution of measurements under the Null hypothesie

2σ

Moments and frequentist probability

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05
Identify what you expect your measurement's distribution to be if the Null hypothesis holds
Measure your outcome from the data x
If x is outside of the area of "reasonable doubt" under the Null hypothesis => the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

in the falsification framework: p-value

Null hypothesis rejected

(p-value 0.05)

Moments and frequentist probability

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05
Identify what you expect your measurement's distribution to be if the Null hypothesis holds
Measure your outcome from the data x
If x is outside of the area of "reasonable doubt" under the Null hypothesis => the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

in the falsification framework: p-value

Null hypothesis cannot be rejected at a

p-value 0.05

NHRT: p-value

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05

Null Hypothesis Rejection Testing

set up threshold α

its important to do this first. If we do not we may be tempted to choose a threshold that fits our result, thus always reporting rejection of null hypothesis

NHRT: p-value

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05
Identify what you expect your measurement's distribution to be if the Null hypothesis holds

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

NHRT: p-value

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05
Identify what you expect your measurement's distribution to be if the Null hypothesis holds
Measure your outcome from the data x; extract the appropriate statistics from a set of data (e.g. mean, median...)

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

measure outcome from data: x0

NHRT: p-value

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05
Identify what you expect your measurement's distribution to be if the Null hypothesis holds
Measure your outcome from the data x; extract the appropriate statistics from a set of data (e.g. mean, median...)
If x is outside of the area of "reasonable doubt" under the Null hypothesis the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

falsify H0

measure outcome from data: x0

H0 cannot be falsified

p(|x|>|x_0|)<\alpha \implies

p(|x|>|x_0|)\geq \alpha \implies

NHRT: p-value

Set a threshold you believe corresponds to "reasonable doubt" 95% => α=0.05
Identify what you expect your measurement's distribution to be if the Null hypothesis holds
Measure your outcome from the data x; extract the appropriate statistics from a set of data (e.g. mean, median...)
If x is outside of the area of "reasonable doubt" under the Null hypothesis the null hypothesis is rejected at p-value = α, otherwise the Null cannot be rejected.

Null Hypothesis Rejection Testing

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

falsify H0

measure outcome from data: x0

H0 cannot be falsified

p(|x|>|x_0|)<\alpha \implies

this quantity is called a "statistics"

p(|x|>|x_0|)\geq \alpha \implies

statistics

2.1

In NHRT a statistics is a quantity that relates to the data which has a known distribution under the Null Hypothesis

e.g.: Z statistics is Normally distributed Z~N(0,1)

In absence of effect (i.e. under the Null)

== the sample mean is the same as the population mean

Z is distributed according to a Gaussian N(μ=0, σ=1)

Z = \frac{\mu - \bar{x}}{\sigma/\sqrt{N}}

Does a sample come from a known population? Z -test

Example: new bus route implementation.

https://github.com/fedhere/PUS2022_FBianco/blob/master/classdemo/ZtestBustime.ipynb

You know the mean and standard deviation of a but travel route: that is the population

You measure the new travel time between two stops 10 times: that is your sample.

Has travel time changed?

2σ

95%

In absence of effect (i.e. under the Null)

== the proportions of men and women are the same

Z is distributed according to a Gaussian N(μ=0, σ=1)

p =\frac{p_0 n_0 + p_1 n_1}{n_0+n_1}\\ SE = \sqrt{ p ( 1 - p ) (\frac{1}{n_0} + \frac{1}{n_1}) }\\ Z = \frac{(p_0 - p_1)}{SE} \\

Are 2 proportions (fractions) the same? Z -test

Example: citibike women usage patterns

https://github.com/fedhere/PUS2020_FBianco/blob/master/classdemo/citibikes_gender.ipynb

You want to know if women are less likely than man to use citibike to commute.

You know the fraction of rides women (men) take during the week

2σ

95%

Are 2 proportions (fractions) the same? Z -test

Example: citibike women usage patterns

Citibikes is the bike share system in place in NYC

They have pioneered not only bikeshare but also open data on the bikes usage

e.g. https://www.kaggle.com/datasets/sujan97/citibike-system-data

https://github.com/fedhere/PUS2022_FBianco/blob/master/classdemo/citibikes_gender.ipynb

You want to know if customers identifying as women are less likely than customers identifying as men to use citibike to commute (commute as opposed to recreational use)

Commuting is more likely to happen during weekdays as most people have weekday jobs, than over the weekend, so

Assumption: weekday trips are commuting trips, weekend trips are recreational trips

You know the fraction of rides women (men) take during the week

Statistics and tests

data kinds and nomenclature

3

Types of Data:

Data Definitions

Data: observations that have been collected
Population: the complete body of subjects we want to infer about

Sample: the subset of the population about which data is collected/available

Census: collection of data from the entire population

Parameter: the subset of the population we actually studied collection of data from the entire population

Statistics: numerical value describing an attribute of the population numerical value describing an attribute of the sample

Data Definitions

The analysis of our ______

showed that for our 10 _________ the mean income is $60k. The standard deviation of the ______ means is $12k. From this _______ we infer for the _____________ a mean income _________ $60k +/- $12k

data

sample

statistics

population

parameter

At the root is the fact that a sample drawn from a parent distribution will look increasingly more like the parent distribution as the size of the sample increases.

More formally: The distribution of the means of N samples generated from the same parent distribution will

I. be normally distributed (i.e. will be a Gaussian)

II. have mean equal to the mean of the parent distribution, and

III. have standard deviation equal to the parent population standard deviation divided by the square root of the sample size

Qualitative variables

No ordering

UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Types of Data:

Qualitative variables

No ordering

UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

Time, Distance, Age, Length, Intensity, Satisfaction, Number of

Types of Data:

Qualitative variables

No ordering

UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

Time, Distance, Age, Length, Intensity, Satisfaction, Number of

Counts:

number of people in a county

Ordinal:

survey response Good/Fair/Poor

discrete

Types of Data:

Qualitative variables

No ordering

UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

Time, Distance, Age, Length, Intensity, Satisfaction, Number of

continuous

Counts:

number of people in a county

Ordinal:

survey response Good/Fair/Poor

Continuous

Ordinal:

Earthquakes (notlinear scale)

Interval:

F temperature interval size preserved

Ratio:

Car speed

0 is naturally defined

discrete

Types of Data:

Qualitative variables

No ordering

UrbanScience e.g. precinct, state, gender, Also called Nominal, Categorical

Quantitative variables

Ordering is meaningful

Time, Distance, Age, Length, Intensity, Satisfaction, Number of

continuous

Counts:

number of people in a county

Ordinal:

survey response Good/Fair/Poor

Continuous

Ordinal:

Earthquakes (notlinear scale)

Interval:

F temperature interval size preserved

Ratio:

Car speed

0 is naturally defined

discrete

Censored: age>90

Missing: “Prefer not to answer” (NA / NaN)

Types of Data:

which is the right test for me?

4

epistemological rooots of overfitting

4

Okham's razor

Ockham’s razor: Pluralitas non est ponenda sine neccesitate

or “the law of parsimony”

William of Ockham (logician and Franciscan friar) 1300ca

but probably to be attributed to John Duns Scotus (1265–1308)

”Complexity needs not to be postulated without a need for it”

“Between 2 theories choose the simpler one”

Okham's razor

Ockham’s razor: Pluralitas non est ponenda sine neccesitate

or “the law of parsimony”

William of Ockham (logician and Franciscan friar) 1300ca

but probably to be attributed to John Duns Scotus (1265–1308)

”Complexity needs not to be postulated without a need for it”

“Between 2 theories choose the simpler one”

“Between 2 theories choose the one with fewer parameters"

Heliocentric model from Nicolaus Copernicus'

"De revolutionibus orbium coelestium".

Author Dr Long's copy of Cassini, 1777

Peter Apian, Cosmographia, Antwerp, 1524

Okham's razor

Heliocentric model from Nicolaus Copernicus'

"De revolutionibus orbium coelestium".

Author Dr Long's copy of Cassini, 1777

Okham's razor

Two theories may explain a phenomenon just as well as each other. In that case you should prefer the simpler one

Okham's razor

data

model fit to data

Okham's razor

model fit to data

y = ax^2 + bx + c

y = ax + b

Okham's razor

model fit to data

1 variable: x

y = ax^2 + bx + c

y = ax + b

Okham's razor

model fit to data

parameters

the complexity of a model can be measured by the number of variables and the numbers of parameters

y = ax^2 + bx + c

y = ax + b

Okham's razor

the complexity of a model can be measured by the number of variables and the numbers of parameters

mathematically: given N data points there exist an N-features model that goes exactly through each data point. but is it useul??

Overfitting: fitting data with a model that is too complex and that does not extend to new data (low predictive power on test data)

https://github.com/fedhere/PUS2020_FBianco/blob/master/classdemo/overfit_animation.ipynb

Okham's razor

data

model fit to data

Okham's razor

https://nbviewer.jupyter.org/gist/fedhere/ef2da384b9e114267e8e93b7366e4ff6

key concepts

Descriptive statistics: mean, median, standard deviation, interquartile range

Definition: Types of data

NHRT Null Hypothesis Rejection Testing and p-values

Definition: Parameters, features, variables

Statistical tests:

how to use it (statistics value compared to the distribution under the null)

how to choose it : what kind of data? what kind of question?

Distributions: frequency and probability interpretations

key concepts

From idea to hypothesis

idea

measurable quantity

statistics

H0 Null hypothesis

Ha Alternative hypothesis

falsify H0

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

falsify H0

measure outcome from data: x0

H0 cannot be falsified

p(|x|>|x_0|)<\alpha \implies

p(|x|>|x_0|)\geq \alpha \implies

key concepts

NHRT setup:

set up threshold α

identify how you expect your measurement to be distributed under the Null hypothesis

falsify H0

measure outcome from data: x0

H0 cannot be falsified

p(|x|>|x_0|)<\alpha \implies

p(|x|>|x_0|)\geq \alpha \implies

principles of Urban Science 3

this slide deck:

Preamble: kinds of analytical questions

measuring differences

measuring differences

distribution

distributions

distributions

distributions

distributions

distributions

distributions

measuring differences between distributions

Moments and frequentist probability

Moments and frequentist probability

Moments and frequentist probability

Moments and frequentist probability

measuring differences between distributions

measuring differences between distributions

Preamble: kinds of analytical questions

Moments and frequentist probability

Moments and frequentist probability

Moments and frequentist probability

Moments and frequentist probability

Moments and frequentist probability

in the falsification framework: p-value

Moments and frequentist probability

in the falsification framework: p-value

Moments and frequentist probability

in the falsification framework: p-value

NHRT: p-value

Null Hypothesis Rejection Testing

NHRT: p-value

Null Hypothesis Rejection Testing

NHRT: p-value

Null Hypothesis Rejection Testing

NHRT: p-value

Null Hypothesis Rejection Testing

NHRT: p-value

Null Hypothesis Rejection Testing

Okham's razor

Okham's razor

Okham's razor

Okham's razor

Okham's razor

Okham's razor

Okham's razor

Okham's razor

Okham's razor

Okham's razor

Okham's razor

key concepts

key concepts

key concepts

reading

homework

principle of urban science III

More from federica bianco