ML for physical and natural scientists 2023 1

 
dr.federica bianco | fbb.space |    fedhere |    fedhere 
I: epistemological concepts and working environment


epistemology

1  what is data science
2 the scientific method

   falsifiability

   probabilistic induction

   reproducibility

3 data science tools

   github

   python

   jupyter notebooks

   google colab

   stackoverflow

who  am I?

astrophysics -> data science 

astrophysics stands our as an observational, rather than experimental science

 

to "observe" the natural status of a system provides advantages. To inquire the system about its status may provide a biased response

(e.g. surveys have many biases)

who  am I?

astrophysics stands out as an observational, rather than experimental science

 

to "observe" the natural status of a system provides advantages. To inquire the system about its status may provide a biased response

(e.g. surveys have many biases)

Historical perspective: Big Data

"Data larger than can be analyzed with typical tool"

"Data that does not fit in memory"

"Data that stresses the infrastructure"

@fedhere

"Data larger than can be analyzed with typical tool"

John R. Mashey Chief Scientist, SGI, mid-1990s

@fedhere

Historical perspective: Big Data

Historical perspective

"Data that does not fit in memory"

@fedhere

Historical perspective

- Big Data

- Data Science (x30)

- Artificial Intelligence (x10)

 

1996                                              2006                                                    2016

occurrence of term in Google-books corpus https://books.google.com/ngrams

Historical perspective

- Big Data

- Data Science (x30)

- Artificial Intelligence (x10)

 

occurrence of term in Google-books corpus https://books.google.com/ngrams

Historical perspective

Gartner report 2001



@fedhere

4-V of Big Data

V1: Volume
Number of bites

 

Number of pixels

 

Number of rows in a data table x number of columns for catalogs


 

 

V2: Variety
Diverse science return from the same dataset.

Multiwavelength

Multimessenger

Images and spectra

V4: Veracity
This V will refer to both data quality and availability (added in 2012)
 

V3: Velocity

real time analysis, edge computing, data transfer

Gartner report 2001



@fedhere

@fedhere

Exquisite image quality

all over the sky

over and over again 

SDSS image circa 2000
HSC image circa 2018

when you look at the sky at this resolution and this depth...

everything is blended and everything is changing

Gartner report 2001



Gartner report 2001



@fedhere

Historical perspective

Text

complexity

@fedhere

complexity

Exquisite image quality

all over the sky

over and over again 

SDSS image circa 2000
HSC image circa 2018

when you look at the sky at this resolution and this depth...

everything is blended and everything is changing

Gartner report 2001



Gartner report 2001



@fedhere

Historical perspective

Text

complexity

@fedhere

complexity

= Astronomical data mainly include images, spectra, time-series data, and simulation data.

Most of the data are saved in catalogues or databases. The data from different telescopes or projects have their own formats, which causes difficulty with integrating data from various sources in the analysis phase. In general, each data item has a thousand or more features; this causes a large dimensionality problem. Moreover, data have many data types: structured, semi-structured, unstructured, and mixed.

why Data Science?

astrophysics -> data science 

UD Data Science Institute - Inaugural event

what is data science? we have been using data in science the whole time, but with the volume, rate, and complexity of the current data we have to worry about things that we would neglect until now: what happens if our data has errors, what happens if we have missing data?

Lou Rossi, Mathematical Sciences Chairperson, UD 

(astrophysicists have always worried about that)

why Data Science?

why Data Science?

 

what is  machine learning?

 

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

model

parameters: slope (a), intercept (b)

y = ax + b

Model:

a mathematical formula with parameters

a
b

 

what is  machine learning?

 

Model:

a mathematical formula with parameters

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

y = ax + b
x

model variable: x - for us this will always be  time

model

parameters: slope (a), intercept (b)

 

what is  machine learning?

 

Data:

a set of observations

Model:

a mathematical formula with parameters

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

 

what is  machine learning?

 

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

Data:

a set of observations

Model:

a mathematical formula with parameters

for every parameter there are an infinity of models

 

what is  machine learning?

 

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

Data:

a set of observations

Model:

a mathematical formula with parameters

Use the data to learn the parameters of the model

 

what is  machine learning?

 

Machine Learning models are parametrized representation of "reality"  where the parameters are learned from finite sets of realizations of that reality

Machine Learning is the disciplines that conceptualizes, studies, and applies those models.

Key Concept

 

what is  machine learning?

 

Week 1: Probability and statistics (stats for hackers)

Week 2: linear regression - uncertainties

Week 3: unsupervised learning - clustering

Week 4: kNN | CART (trees)

Week 5: Neural Networks - basics

Week 6: CNNs

Week 7: Autoencoders

Week 8: Physically motivated NN | Transformers 

Somewhere I will also cover:

notes on visualizations

notes on data ethics

Tuesday: "theory"

Thursday: "hands on work"

Friday: "recap and preview"

 

slidocom

#2492 113

some administrative stuff

0

Syllabus

Syllabus

  • 10% pre-class questions

  • 10% class participation

  • 25% midterm

  • 15% final written

  • 50% final interview

quiz

 

pre-class questions

 

from beginning of class to 5 minutes past (be on time!)

questions on previous class material and reading assignments

  • 10% pre-class questions

  • 10% class participation

  • 25% midterm

  • 15% final written

  • 50% final interview

midterm

 

For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.

 

Midterm... probably in class

issues: stereotype thread - working under derass is not necessarily a required skill

advantages: interviews for jobs are often timed

  • 10%  pre-class questions

  • 10% class performance and participation

  • 20% homeworks

  • 25% midterm

  • 35% final

final

 
 

For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.

 

Final: take home, 3 days, 30 min "interview" after the last day

Resources

Resources

  • SLIDES here in PDF form
  • EXERCISES INSTRUCTIONS here
  • RESOURCES here

 

If notebooks do not display

use

 

https://nbviewer.jupyter.org

Syllabus

Resources

Resources

Resources

Jake Vanderplas is a physicst-data scientists

what is data science?

1

general knowledge assumed

most of this will be neglected

STATISTICS

PROGRAMMING

MACHINE LEARNING

DATA MUNGING

DATA INGESTION

what's left?

VISUALIZATION

STATISTICS

PROGRAMMING

MACHINE LEARNING

VISUALIZATION

DATA MUNGING

DATA INGESTION

what's left?

python

STATISTICS

PROGRAMMING

MACHINE LEARNING

VISUALIZATION

DATA MUNGING

DATA INGESTION

what's left?

python

probability distributions

p-values

uncertainties

MCMC

STATISTICS

PROGRAMMING

MACHINE LEARNING

VISUALIZATION

DATA MUNGING

DATA INGESTION

what's left?

python

probability distributions

p-values

uncertainties

MCMC

regression

(linear, template)

classification

(trees, neural networks)

clustering

Time series analysis

Geospacial analysis?

the scientific method

(what is science?)

2

epistemology: 

 

the philosophy of science and of the scientific method

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

what is science? what is not?

a scientific theory must be  falsifiable

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

what is science? what is not?

model

prediction

the demarcation problem

Einstein GR

the demarcation problem

model

prediction

Light rays are deflected by mass

model

prediction

data

does not falsify

falsifies

GR

still holds

GR

rejected

the demarcation problem

position of star changes during eclipse

position of star does not change during eclipse

is phsychology a science?

the demarcation problem

DISCUSS!

the demarcation problem

A theory can be said to be scientific if it makes falsifiable predictions

Experiments should be designed to falsify the predictions

 

Key Concept

the demarcation problem

things can get more complicated though:

most scientific theories are actually based largely on probabilistic induction and

modern inductive inference  (Solomonoff,  frequentist vs Bayesian methods...)

 

everything has **some** probability of happening. But it might be very small

traditional statistics works as follows:
- if the probability is smaller than some arbitrary cut (e..g p~0.05) then I will say that it is not true

 

 

the demarcation problem

things can get more complicated though:

most scientific theories are actually based largely on probabilistic induction and

modern inductive inference  (Solomonoff,  frequentist vs Bayesian methods...)

 

everything has **some** probability of happening. But it might be very small

traditional statistics works as follows:
- if the probability is smaller than some arbitrary cut (e..g p~0.05) then I will say that it is not true

- what about ML?? all it does is (1) make predictions (2) find structure in data

 

Text

 

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

assures a result is grounded in evidence

1

#openscience

#opendata

 

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitates scientific progress by avoiding the need to duplicate unoriginal research 

2

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitate collaboration and teamwork

3

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

Reproducibility

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

  • provide raw data and code to reduce it to all stages needed to get outputs
 
  • provide code to reproduce all figures
  • provide code to reproduce all number outcomes

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducibility

Reproducibility

A research product is reproducible if all numbers can be reproduced exactly be applying the same code to the same raw data.

It is the responsibility of the researcher to provide the data and code that make a research product reproducible

Key Concept

the tools

3

github

Reproducible research means:

 

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!) using the code and raw data provided by the analyst.

reproducibility

allows reproducibility through code distribution

github

the Git software

is a distributed version control system:

a version of the files on your local computer is made also available at a central server.

The history of the files is saved remotely so that any version (that was checked in) is retrievable.

version control

allows version control

github

collaboration tool

by fork, fork and pull request, or by working directly as a collaborator

collaborative platform

allows effective collaboration

python

  • intuitive and readable
  • open source
  • support C integration for performance
  • packages designed for science: 
    • scipy 
    • statsmodels
    • numpy (computation)
    • sklearn (machine learning)

python

  • intuitive and readable
  • open source
  • support C integration for performance
  • packages designed for science: 
    • scipy 
    • statsmodels
    • numpy (computation)
    • sklearn (machine learning)

python

series of notebooks designed for Urban Science students by Dr. Mohit Sharma (in consultation with me)

 

recommended if you are brand new to python and coding or are serious about cleaning up your fundamentals

 

ignore the references to the CUSP working environment and work on https://colab.research.google.com/notebooks instead

 

python

quick bootcamp 

recommanded if you know some python or if you know some other conding language reasonably proficiently

python

online book

python

PEP8: Python Enhancement Proposals 8

“This document gives coding conventions for the Python code comprising the standard library in the main Python distribution.”

python

A Python Bootcamp

Jupyter Notebook  Google Colaboratory

Jupyter Notebook 

local setup

1

install docker image from my account here

Jupyter Notebook 

local setup

2

handle your own installation with python or anaconda (or whatever else on linux and windows) but make sure results are reproducible on google colab

stackoverflow

for when you need help

you can ask coding questions, installation questions, colab questions...

stackoverflow

for when you need help

you can ask coding questions, installation questions, colab questions...

stackoverflow

for when you need help

you can ask coding questions, installation questions, colab questions...

it can be a toxic environment...

key   concepts

Science and Data Science

Falsifiability

Reproducibility

 

 

 

 

homework

  • make an account on GitHub if you do not have one yet
  • Create a repository called MLPNS_<firstinitialLastname>
  • use the form to confirm you read the Code of Conduct and deliver your repo link
  • write a Readme.md file to state what this repo is for, what your motivation to take this class is, what you hope to learn. At the end of the course we will reflect on these early expectations

1

homework

TBD

2

reading

 

Jeff Leek & Rodger Peng. 2015,
What is the Question?

the original link:

https://science.sciencemag.org/content/347/6228/1314.summary

(this link nees access to science magazine, but ou can use the link above which is the same file)

 

2

additional  reading

 

Karl Popper, J. 1934,

The Logic of Scientific Discovery

http://strangebeautiful.com/other-texts/popper-logic-scientific-discovery.pdf

Claerbout, J. 1990,

Active Documents and Reproducible Results,

Stanford Exploration Project Report, 67, 139

http://sepwww.stanford.edu/data/media/public/docs/sep67/jon2/paper_html/

machine learning for natural and physical scientists 1, 2023

By federica bianco

machine learning for natural and physical scientists 1, 2023

intro to this class

  • 506