Loading

Foundations of DS for everyone -I

federica bianco

This is a live streamed presentation. You will automatically follow the presenter and see the slide they're currently on.

foundations of data science for everyone

dr.federica bianco | fbb.space |    fedhere |    fedhere 
I: what is data science and working environment

 

epistemology

1  what is data science
2 the scientific method

   falsifiability

   probabilistic induction

   reproducibility

3 data science tools

   github

   python

   jupyter notebooks

   google colab

   stackoverflow

what is data science?

1

foundations of data science for everyone

 
dr.federica bianco | fbb.space |    fedhere |    fedhere 
dr.federica bianco | fbb.space |    fedhere |    fedhere 

coding

statistics

domain knowledge

Data Science: the field of studies that deals with the extraction of information from data within a domain context to enable interpretation and prediction of phenomena. 

 

This includes development and application of statistical tools and machine learning and AI methods

foundations of data science for everyone

 
dr.federica bianco | fbb.space |    fedhere |    fedhere 
dr.federica bianco | fbb.space |    fedhere |    fedhere 

Artificial Intelligence:

enable machines to make decisions without being explicitly programmed

Machine Learning:

machines learn directly from data and examples

Data Science: the field of studies that deals with the extraction of information from data within a domain context to enable interpretation and prediction of phenomena. 

 

This includes development and application of statistical tools and machine learning and AI methods

Deep Learning
(Neural Networks)

we will refer to this as needed

STATISTICS

PROGRAMMING

we will not get to this

PROGRAMMING

MACHINE LEARNING

DATA MUNGING

DATA INGESTION

but the focus is this

VISUALIZATION

NLP

STATISTICS

STATISTICS

PROGRAMMING

MACHINE LEARNING

VISUALIZATION

DATA MUNGING

DATA INGESTION

what's left?

probability distributions

p-values

uncertainties

 

python

STATISTICS

PROGRAMMING

MACHINE LEARNING

VISUALIZATION

DATA MUNGING

DATA INGESTION

what's left?

python

probability distributions

p-values

uncertainties

 

regression

(linear, template)

classification

(trees, neural networks)

clustering

 

some administrative stuff

2

😷

Be safe so we can stay!!

Syllabus

Syllabus

  • 5%  pre-class questions

  • 15%  class participation (ask questions, contribute in breakouts!!)

  • 25% homework

  • 15% midterm

  • 30% final

Syllabus

  • 5%  pre-class questions

  • 15%  class participation (ask questions, contribute in breakouts!!)

  • 25% homework

  • 15% midterm

  • 30% final

from beginning of class to 5 minutes past (be on time!)

questions on previous class material and reading assignments

Syllabus

ask questions

answer questions

get up and code

extra credit assignments

  • 5%  pre-class questions

  • 15%  class participation (ask questions, contribute in breakouts!!)

  • 25% homework

  • 15% midterm

  • 30% final

Please work in groups of up to 3-5 people on homework as a collaborative projects.

All members of the group are responsible for the assignment.

The assignment must be uploaded in every student's repository. Does not have to be identical for all group members, but it has to be just as complete as the one turned in

 

homework

 
  • 5%  pre-class questions

  • 15%  class participation                                                                         

  • 25% homework

  • 15% midterm

  • 30% final

Code of Conduct

to ensure a healthy and safe collaborative environment

1) Read this:    bit.ly/fdsfe_coc

 

 

 

2) Answer questions here: https://bit.ly/fdsfe_cocform

3) join the slack with the link you receive after finishing the form

the tools

3

Jupyter Notebook  Google Colaboratory

A collaborative platform for python coding

python

  • intuitive and readable
  • open source
  • support C integration for performance
  • packages designed for science: 
    • scipy 
    • statsmodels
    • numpy (computation)
    • sklearn (machine learning)

python

  • intuitive and readable
  • open source
  • support C integration for performance
  • packages designed for science: 
    • scipy 
    • statsmodels
    • numpy (computation)
    • sklearn (machine learning)

python

Resources: Notebook based

Most compact and rapid:

Xiaolong Li crash course

python

Slightly more comprehensive - 

python bootcamp

If there is demand in a couple of weeks

I will run a live session going over this bootcamp

 

 

python

Resources: Notebook based

python

series of notebooks designed for Urban Science students by Dr. Mohit Sharma (in consultation with me)

 

recommanded if you are brand new to python and coding or are serious about cleaning up your foundamentals

python

Resources: Notebook based

python

Free series of videos on the Giraffe Academy

python

Resources: other

Resources

many books, on github you can find links to the PDFs... but we will do things a bit differently

Resources

Jake Vanderplas is a physicst-data scientists

https://www.academia.edu/40917232/Python_Data_Science_Handbook

python

PEP8: Python Enhancement Proposals 8

“This document gives coding conventions for the Python code comprising the standard library in the main Python distribution.”

github

Reproducible research means:

 

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!) using the code and raw data provided by the analyst.

reproducibility

allows reproducibility through code distribution

github

the Git software

is a distributed version control system:

a version of the files on your local computer is made also available at a central server.

The history of the files is saved remotely so that any version (that was checked in) is retrievable.

version control

allows version control

github

collaboration tool

by fork, fork and pull request, or by working directly as a collaborator

collaborative platform

allows effective collaboration

stackoverflow

for when you need help

you can ask coding questions, installation questions, colab questions...

stackoverflow

for when you need help

you can ask coding questions, installation questions, colab questions...

stackoverflow

for when you need help

you can ask coding questions, installation questions, colab questions...

it can be a toxic environment...

some administrative stuff (cont'd)

1

  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

Homework projects must be turned in as jupyter notebooks by checking them into your github account in a DSPS_<firstinitialLastname> repo and the project directories HW<hw number> (unless otherwise stated). <finitialLastname> is e.g. fBianco

 

homework

 

instructions

solution

homework are assigned as skeleton notebooks with missing code

You will have to insert the code to get the correct output

You should then discuss the results you got (e.g. comment on a plot

you will be graded on

(80% of the grade)

1) rendered plots (does it show what it should)

2) plot captions (can you interpret what it shows)

2) obtaining "correct" numbers where needed

3) interpreting each result you get

homework

 
  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

homework

 
  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

A statement must be included in the README explaining each team member’s contribution (similar to an acknowledge of contribution you would find in a Nature letter see, for example these contributions).

 

homework

 
  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

A statement must be included in the README explaining each team member’s contribution (similar to an acknowledge of contribution you would find in a Nature letter see, for example these contributions).

 

homework

 
  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

Each student must write a README file for their repository

(20% of the points)

in the readme you must state in your own words

  1. what was this homework about? relate it to what we discussed in class
  2. what was the hardest part of the homework for you?
  3. what was the easiest part of the homework for you?
  4. one new thing that you have learned

homework

 

instructions will be here 

https://github.com/fedhere/FDSfe_FBianco

  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

homework

 

help: always available on slack!

please sign up asap by filling in the form!!

https://forms.gle/j4vejz7R2qwwQrdr9

  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

homework

 

of course there is also Canvas, which will be used to give you grades and occasionally post messages

  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

midterm

 

For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.

 

Midterm... probably in class

advantages: interviews for jobs are often timed

issues: working under derass is not necessarily a required skill but that is why the midterm counts only 15%!

  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

final

 
 

For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.

 

Final: take home, multiple days.

  • 5%  pre-class questions

  • 15%  class participation

  • 25% homework

  • 15% midterm

  • 30% final

Resources

  • SLIDES --> https://slides.com/federicabianco/decks/fdsfe
  • HOMEWORK INSTRUCTIONS --> github
  • RESOURCES --> slack, github

 

If notebooks do not display

use

 

https://nbviewer.jupyter.org

Resources

slack

Resources

the steps of data-analysis and inference: descriptive and exploratory analysis

4

import pandas as pd
df = pd.read_csv(file_name)
df.describe()

- how is data organized

- is data complete?

- what are the statistical properties of the data

we will look at the statistical properties next week: mean, standard deviation, median, quantiles...

- searching for anomalies, trends, correlations, or relationships between the measurements 

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

r_{xy} = 1~\mathrm{iff}~y=ax\\ ~\mathrm{maximally~correlated}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

r_{xy} = 1~\mathrm{iff}~y=-ax\\ ~\mathrm{maximally~anticorrelated}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

not linearly correlated

 

Pearson's coefficient = 0

 

does not mean that x and y are independent! 

\rho_{xy} = 1-\frac{6\sum_{i=1}^N(x_i - y_i)^2}{n(n^2-1)}

Pearson's correlation

Spearman's test

(Pearson's for ranked values)

correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}
r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)
import pandas as pd
df = pd.read_csv(file_name)
df.corr()
\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

import pandas as pd
df = pd.read_csv(file_name)
df.corr()
pl.imshow(vdf.corr(), clim=(-1,1),  cmap='RdBu')
pl.xticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.yticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.colorbar();

<- anticorrelated | correlated ->

Correlation does not imply causality!!

2 things may be related because they share a cause but not cause each other:

icecream sales with temperature |death by drowning

with temperature

In the era of big data you may encounter truly spurious correlations

divorce rate in Maine | consumption of Margarine

correlation

the scientific method and good scientific practices part 1: Reproducibility

5

3 General principles of "good" science

Falisifiability

Parsimony

Reproducibility

epistemology: 

 

the philosophy of science and of the scientific method

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

what is science? what is not?

a scientific theory must be  falsifiable

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

what is science? what is not?

model

prediction

the demarcation problem

Einstein GR

the demarcation problem

model

prediction

Light rays are deflected by mass

model

prediction

data

does not falsify

falsifies

GR

still holds

GR

rejected

the demarcation problem

position of star changes during eclipse

position of star does not change during eclipse

is phsychology a science?

the demarcation problem

DISCUSS!

the demarcation problem

things can get more complicated though:

most scientific theories are actually based largely on probabilistic induction and

modern inductive inference  (Solomonoff,  frequentist vs Bayesian methods...)

 

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

assures a result is grounded in evidence

1

#openscience

#opendata

 

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitates scientific progress by avoiding the need to duplicate unoriginal research 

2

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitate collaboration and teamwork

3

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

Reproducibility

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

  • provide raw data and code to reduce it to all stages needed to get outputs
 
  • provide code to reproduce all figures
  • provide code to reproduce all number outcomes

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducibility

key   concepts

What is Data Science

 

 

 

 

 

 

 

 

homework

  • make an account on GitHub if you do not have one yet
  • Create a repository called FDSfE_<firstinitialLastname>
  • Add a README file to your repository that indicates that this is the repo for Foundations of Data Science

1

homework

  • upload to your github repo the colab notebook we have worked on in class. the colab notebooks should appear in a folder HW1 and have a google colab link (more instructions will appear on github)

2

reading

 

Jeff Leek & Rodger Peng. 2015,
What is the Question?

the original link:

https://science.sciencemag.org/content/347/6228/1314.summary

(this link nees access to science magazine, but ou can use the link above which is the same file)

 

2

additional  reading

 

Claerbout, J. 1990,

Active Documents and Reproducible Results,

Stanford Exploration Project Report, 67, 139

http://sepwww.stanford.edu/data/media/public/docs/sep67/jon2/paper_html/

Text

reading