epistemology

1 what is data science
2 the scientific method

falsifiability

probabilistic induction

reproducibility

3 data science tools

jupyter notebooks

stackoverflow

this slide deck

https://slides.com/federicabianco/fds_01

what is data science?

1 foundations of data science for everyone

dr.federica bianco | fbb.space |    fedhere |    fedhere

dr.federica bianco | fbb.space |    fedhere |    fedhere

coding

statistics

domain knowledge

Data Science: the field of studies that deals with the extraction of information from data within a domain context to enable interpretation and prediction of phenomena.

This includes development and application of statistical tools and machine learning and AI methods

foundations of data science for everyone

dr.federica bianco | fbb.space |    fedhere |    fedhere

dr.federica bianco | fbb.space |    fedhere |    fedhere

Artificial Intelligence:

enable machines to make decisions without being explicitly programmed

Machine Learning:

machines learn directly from data and examples

Data Science: the field of studies that deals with the extraction of information from data within a domain context to enable interpretation and prediction of phenomena.

This includes development and application of statistical tools and machine learning and AI methods

Deep Learning
(Neural Networks)

http://nirvacana.com/thoughts/2013/07/08/becoming-a-data-scientist

http://nirvacana.com/thoughts/2013/07/08/becoming-a-data-scientist

we will refer to this as needed

STATISTICS

PROGRAMMING

http://nirvacana.com/thoughts/2013/07/08/becoming-a-data-scientist

we will not get to this

http://nirvacana.com/thoughts/2013/07/08/becoming-a-data-scientist

PROGRAMMING

MACHINE LEARNING

DATA MUNGING

DATA INGESTION

but the focus is this

VISUALIZATION

NLP

STATISTICS

http://nirvacana.com/thoughts/2013/07/08/becoming-a-data-scientist

STATISTICS

PROGRAMMING

MACHINE LEARNING

VISUALIZATION

DATA MUNGING

DATA INGESTION

what's left?

probability distributions

p-values

uncertainties

python

http://nirvacana.com/thoughts/2013/07/08/becoming-a-data-scientist

STATISTICS

PROGRAMMING

MACHINE LEARNING

VISUALIZATION

DATA MUNGING

DATA INGESTION

what's left?

python

probability distributions

p-values

uncertainties

regression

(linear, template)

classification

(trees, neural networks)

clustering

some administrative stuff

2

😷

Be safe so we can stay!!

Syllabus

syllabus

Syllabus

5% pre-class questions
15% class participation (ask questions, contribute in breakouts!!)
25% homework
15% midterm
30% ﬁnal

syllabus

Syllabus

5% pre-class questions
15% class participation (ask questions, contribute in breakouts!!)
25% homework
15% midterm
30% ﬁnal

from beginning of class to 5 minutes past (be on time!)

questions on previous class material and reading assignments

syllabus

Syllabus

ask questions

answer questions

get up and code

extra credit assignments

syllabus

5% pre-class questions
15% class participation (ask questions, contribute in breakouts!!)
25% homework
15% midterm
30% ﬁnal

Please work in groups of up to 3-5 people on homework as a collaborative projects.

All members of the group are responsible for the assignment.

The assignment must be uploaded in every student's repository. Does not have to be identical for all group members, but it has to be just as complete as the one turned in

homework

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

Code of Conduct

to ensure a healthy and safe collaborative environment

1) Read this: bit.ly/fdsfe_coc

2) Answer questions here: https://bit.ly/fdsfe_cocform

3) join the slack with the link you receive after finishing the form

the tools

3 Jupyter Notebook Google Colaboratory

A collaborative platform for python coding

https://colab.research.google.com

python

https://www.economist.com/graphic-detail/2018/07/26/python-is-becoming-the-worlds-most-popular-coding-language

intuitive and readable
open source
support C integration for performance
packages designed for science:
- scipy
- statsmodels
- numpy (computation)
- sklearn (machine learning)

python

https://www.oreilly.com/ideas/2016-data-science-salary-survey-results

intuitive and readable
open source
support C integration for performance
packages designed for science:
- scipy
- statsmodels
- numpy (computation)
- sklearn (machine learning)

python

Resources: Notebook based

Most compact and rapid:

Xiaolong Li crash course

https://github.com/fedhere/FDSFE_FBianco/blob/main/pythoncrashcourse.ipynb

python

https://github.com/fedhere/PyBOOT

Slightly more comprehensive -

python bootcamp

If there is demand in a couple of weeks

I will run a live session going over this bootcamp

python

Resources: Notebook based

python

series of notebooks designed for Urban Science students by Dr. Mohit Sharma (in consultation with me)

recommanded if you are brand new to python and coding or are serious about cleaning up your foundamentals

https://sharmamohit.com/work/courses/ucsl/

python

Resources: Notebook based

python

Free series of videos on the Giraffe Academy

python

Resources: other

https://www.giraffeacademy.com/programming-languages/python

Resources

many books, on github you can find links to the PDFs... but we will do things a bit differently

Resources

Jake Vanderplas is a physicst-data scientists

https://www.academia.edu/40917232/Python_Data_Science_Handbook

python

PEP8: Python Enhancement Proposals 8

“This document gives coding conventions for the Python code comprising the standard library in the main Python distribution.”

Indentation, Tabs vs Spaces, Maximum Line Length, Blank Lines, Source File Encoding, Imports, Whitespace in Expressions and Statements , Imports, Comments Bookeeping, Naming

github

https://github.com

Reproducible research means:

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!) using the code and raw data provided by the analyst.

Claerbout, J. 1990,

Active Documents and Reproducible Results, Stanford Exploration Project Report, 67, 139

reproducibility

allows reproducibility through code distribution

github

https://github.com

the Git software

is a distributed version control system:

a version of the files on your local computer is made also available at a central server.

The history of the files is saved remotely so that any version (that was checked in) is retrievable.

version control

allows version control

github

collaboration tool

by fork, fork and pull request, or by working directly as a collaborator

collaborative platform

allows effective collaboration

https://github.com

stackoverflow

for when you need help

https://stackoverflow.com/

you can ask coding questions, installation questions, colab questions...

stackoverflow

for when you need help

https://stackoverflow.com/

you can ask coding questions, installation questions, colab questions...

stackoverflow

for when you need help

https://stackoverflow.com/

you can ask coding questions, installation questions, colab questions...

it can be a toxic environment...

some administrative stuff (cont'd)

1

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

Homework projects must be turned in as jupyter notebooks by checking them into your github account in a DSPS_<firstinitialLastname> repo and the project directories HW<hw number> (unless otherwise stated). <finitialLastname> is e.g. fBianco

homework

instructions

solution

homework are assigned as skeleton notebooks with missing code

You will have to insert the code to get the correct output

You should then discuss the results you got (e.g. comment on a plot

you will be graded on

(80% of the grade)

1) rendered plots (does it show what it should)

2) plot captions (can you interpret what it shows)

2) obtaining "correct" numbers where needed

3) interpreting each result you get

homework

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

homework

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

A statement must be included in the README explaining each team member’s contribution (similar to an acknowledge of contribution you would find in a Nature letter see, for example these contributions).

homework

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

A statement must be included in the README explaining each team member’s contribution (similar to an acknowledge of contribution you would find in a Nature letter see, for example these contributions).

homework

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

Each student must write a README file for their repository

(20% of the points)

in the readme you must state in your own words

what was this homework about? relate it to what we discussed in class
what was the hardest part of the homework for you?
what was the easiest part of the homework for you?
one new thing that you have learned

homework

instructions will be here

https://github.com/fedhere/FDSfe_FBianco

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

homework

help: always available on slack!

please sign up asap by filling in the form!!

https://forms.gle/j4vejz7R2qwwQrdr9

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

homework

of course there is also Canvas, which will be used to give you grades and occasionally post messages

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

midterm

For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.

Midterm... probably in class

advantages: interviews for jobs are often timed

issues: working under derass is not necessarily a required skill but that is why the midterm counts only 15%!

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

final

For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.

Final: take home, multiple days.

5% pre-class questions
15% class participation
25% homework
15% midterm
30% ﬁnal

https://github.com/fedhere/FDSfE_FBianco

Resources

SLIDES --> https://slides.com/federicabianco/decks/fdsfe
HOMEWORK INSTRUCTIONS --> github
RESOURCES --> slack, github

If notebooks do not display

use

https://nbviewer.jupyter.org

Resources

https://github.com/fedhere/FDSfE_FBianco/blob/main/Resources.md

slack

Resources

https://github.com/ageron/handson-ml

the steps of data-analysis and inference: descriptive and exploratory analysis

4

import pandas as pd
df = pd.read_csv(file_name)
df.describe()

- how is data organized

- is data complete?

- what are the statistical properties of the data

we will look at the statistical properties next week: mean, standard deviation, median, quantiles...

- searching for anomalies, trends, correlations, or relationships between the measurements

http://www.tylervigen.com/spurious-correlations

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

r_{xy} = 1~\mathrm{iff}~y=ax\\ ~\mathrm{maximally~correlated}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

r_{xy} = 1~\mathrm{iff}~y=-ax\\ ~\mathrm{maximally~anticorrelated}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

not linearly correlated

Pearson's coefficient = 0

does not mean that x and y are independent!

\rho_{xy} = 1-\frac{6\sum_{i=1}^N(x_i - y_i)^2}{n(n^2-1)}

Pearson's correlation

Spearman's test

(Pearson's for ranked values)

correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

import pandas as pd
df = pd.read_csv(file_name)
df.corr()

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

import pandas as pd
df = pd.read_csv(file_name)
df.corr()

pl.imshow(vdf.corr(), clim=(-1,1),  cmap='RdBu')
pl.xticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.yticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.colorbar();

<- anticorrelated | correlated ->

Correlation does not imply causality!!

2 things may be related because they share a cause but not cause each other:

icecream sales with temperature |death by drowning

with temperature

In the era of big data you may encounter truly spurious correlations

divorce rate in Maine | consumption of Margarine

correlation

http://www.tylervigen.com/spurious-correlations

the scientific method and good scientific practices part 1: Reproducibility

5

3 General principles of "good" science

Falisifiability

Parsimony

https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science

Reproducibility

epistemology:

the philosophy of science and of the scientific method

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

—Karl Popper, The Logic of Scientific Discovery

the demarcation problem:

what is science? what is not?

a scientific theory must be falsifiable

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

—Karl Popper, The Logic of Scientific Discovery

the demarcation problem:

what is science? what is not?

model

prediction

the demarcation problem

Einstein GR

the demarcation problem

model

prediction

Light rays are deflected by mass

http://discovermagazine.com/2019/may/why-it-took-the-1919-solar-eclipse-for-physicists-to-believe-einstein

model

prediction

data

does not falsify

falsifies

GR

still holds

GR

rejected

the demarcation problem

position of star changes during eclipse

position of star does not change during eclipse

is phsychology a science?

the demarcation problem

DISCUSS!

the demarcation problem

things can get more complicated though:

most scientific theories are actually based largely on probabilistic induction and

modern inductive inference (Solomonoff, frequentist vs Bayesian methods...)

Reproducibility

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

Reproducibility

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

why?

assures a result is grounded in evidence

1

#openscience

#opendata

Reproducibility

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

why?

facilitates scientific progress by avoiding the need to duplicate unoriginal research

2

Reproducibility

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

why?

facilitate collaboration and teamwork

3

Reproducible research in practice:

using the code and raw data provided by the analyst.

Claerbout, J. 1990,

Active Documents and Reproducible Results, Stanford Exploration Project Report, 67, 139

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

Reproducibility

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

provide raw data and code to reduce it to all stages needed to get outputs

provide code to reproduce all figures

provide code to reproduce all number outcomes

Reproducible research in practice:

using the code and raw data provided by the analyst.

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducibility

key concepts

What is Data Science

homework

make an account on GitHub if you do not have one yet
Create a repository called FDSfE_<firstinitialLastname>
Add a README file to your repository that indicates that this is the repo for Foundations of Data Science

1 homework

upload to your github repo the colab notebook we have worked on in class. the colab notebooks should appear in a folder HW1 and have a google colab link (more instructions will appear on github)

2

reading

https://www.aaas.org/sites/default/files/Stats_What_Question_2015.pdf?g_zGQR5m3rDJqwXqJ3DxLI5pXZ3hNdHk

Jeff Leek & Rodger Peng. 2015,
What is the Question?

the original link:

https://science.sciencemag.org/content/347/6228/1314.summary

(this link nees access to science magazine, but ou can use the link above which is the same file)

2

additional reading

Claerbout, J. 1990,

Active Documents and Reproducible Results,

Stanford Exploration Project Report, 67, 139

http://sepwww.stanford.edu/data/media/public/docs/sep67/jon2/paper_html/

https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science

Reproducibility and Replicability in Science

Consensus Study Report (2019)

Foundations of DS for everyone -I

foundations of data science for everyone

epistemology

1 what is data science 2 the scientific method

falsifiability

probabilistic induction

reproducibility

3 data science tools

github

python

jupyter notebooks

google colab

stackoverflow

1

foundations of data science for everyone

foundations of data science for everyone

2

Syllabus

Syllabus

Syllabus

Syllabus

homework

Code of Conduct

3

Jupyter Notebook Google Colaboratory

python

python

python

python

python

python

python

python

python

Resources

Resources

python

Indentation, Tabs vs Spaces, Maximum Line Length, Blank Lines, Source File Encoding, Imports, Whitespace in Expressions and Statements , Imports, Comments Bookeeping, Naming

github

reproducibility

github

version control

github

collaborative platform

stackoverflow

for when you need help

stackoverflow

for when you need help

stackoverflow

for when you need help

1

homework

homework

homework

homework

homework

homework

homework

homework

midterm

final

Resources

Resources

Resources

4

correlation

correlation

correlation

correlation

correlation

correlation

correlation

correlation

correlation

correlation

correlation

5

epistemology:

the philosophy of science and of the scientific method

Reproducibility

1 what is data science
2 the scientific method