federica bianco
astro | data science | data for good
dr.federica bianco | fbb.space | fedhere | fedhere
I: epistemological concepts and working environment
astrophysics -> data science
astrophysics stands our as an observational, rather than experimental science
to "observe" the natural status of a system provides advantages. To inquire the system about its status may provide a biased response
(e.g. surveys have many biases)
astrophysics stands out as an observational, rather than experimental science
to "observe" the natural status of a system provides advantages. To inquire the system about its status may provide a biased response
(e.g. surveys have many biases)
"Data larger than can be analyzed with typical tool"
"Data that does not fit in memory"
"Data that stresses the infrastructure"
"Data larger than can be analyzed with typical tool"
John R. Mashey Chief Scientist, SGI, mid-1990s
"Data that does not fit in memory"
- Big Data
- Data Science (x30)
- Artificial Intelligence (x10)
1996 2006 2016
occurrence of term in Google-books corpus https://books.google.com/ngrams
- Big Data
- Data Science (x30)
- Artificial Intelligence (x10)
occurrence of term in Google-books corpus https://books.google.com/ngrams
Gartner report 2001
V1: Volume
Number of bites
Number of pixels
Number of rows in a data table x number of columns for catalogs
V2: Variety
Diverse science return from the same dataset.
Multiwavelength
Multimessenger
Images and spectra
V4: Veracity
This V will refer to both data quality and availability (added in 2012)
V3: Velocity
real time analysis, edge computing, data transfer
Gartner report 2001
Exquisite image quality
all over the sky
over and over again
SDSS image circa 2000
HSC image circa 2018
when you look at the sky at this resolution and this depth...
everything is blended and everything is changing
Gartner report 2001
Gartner report 2001
Text
complexity
complexity
Exquisite image quality
all over the sky
over and over again
SDSS image circa 2000
HSC image circa 2018
when you look at the sky at this resolution and this depth...
everything is blended and everything is changing
Gartner report 2001
Gartner report 2001
Text
complexity
complexity
= Astronomical data mainly include images, spectra, time-series data, and simulation data.
Most of the data are saved in catalogues or databases. The data from different telescopes or projects have their own formats, which causes difficulty with integrating data from various sources in the analysis phase. In general, each data item has a thousand or more features; this causes a large dimensionality problem. Moreover, data have many data types: structured, semi-structured, unstructured, and mixed.
astrophysics -> data science
UD Data Science Institute - Inaugural event
what is data science? we have been using data in science the whole time, but with the volume, rate, and complexity of the current data we have to worry about things that we would neglect until now: what happens if our data has errors, what happens if we have missing data?
Lou Rossi, Mathematical Sciences Chairperson, UD
(astrophysicists have always worried about that)
[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
model
parameters: slope (a), intercept (b)
Model:
a mathematical formula with parameters
Model:
a mathematical formula with parameters
[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
model variable: x - for us this will always be time
model
parameters: slope (a), intercept (b)
Data:
a set of observations
Model:
a mathematical formula with parameters
[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
Data:
a set of observations
Model:
a mathematical formula with parameters
for every parameter there are an infinity of models
[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
Data:
a set of observations
Model:
a mathematical formula with parameters
Use the data to learn the parameters of the model
Machine Learning models are parametrized representation of "reality" where the parameters are learned from finite sets of realizations of that reality
Machine Learning is the disciplines that conceptualizes, studies, and applies those models.
Key Concept
Week 1: Probability and statistics (stats for hackers)
Week 2: linear regression - uncertainties
Week 3: unsupervised learning - clustering
Week 4: kNN | CART (trees)
Week 5: Neural Networks - basics
Week 6: CNNs
Week 7: Autoencoders
Week 8: Physically motivated NN | Transformers
Somewhere I will also cover:
notes on visualizations
notes on data ethics
Tuesday: "theory"
Thursday: "hands on work"
Friday: "recap and preview"
slidocom
#2492 113
some administrative stuff
10% pre-class questions
10% class participation
25% midterm
15% ﬁnal written
50% final interview
pre-class questions
from beginning of class to 5 minutes past (be on time!)
questions on previous class material and reading assignments
10% pre-class questions
10% class participation
25% midterm
15% ﬁnal written
50% final interview
For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.
Midterm... probably in class
issues: stereotype thread - working under derass is not necessarily a required skill
advantages: interviews for jobs are often timed
10% pre-class questions
10% class performance and participation
20% homeworks
25% midterm
35% ﬁnal
For the Midterm and the Final you are responsible for material in the labs, the reading, and the homework. In preparing for the exams, use the homework as a guide to which material is essential. In the Midterm and Final YOU WILL BE EXPECTED TO WORK INDIVIDUALLY.
Final: take home, 3 days, 30 min "interview" after the last day
Jake Vanderplas is a physicst-data scientists
what is data science?
general knowledge assumed
most of this will be neglected
STATISTICS
PROGRAMMING
MACHINE LEARNING
DATA MUNGING
DATA INGESTION
what's left?
VISUALIZATION
STATISTICS
PROGRAMMING
MACHINE LEARNING
VISUALIZATION
DATA MUNGING
DATA INGESTION
what's left?
python
STATISTICS
PROGRAMMING
MACHINE LEARNING
VISUALIZATION
DATA MUNGING
DATA INGESTION
what's left?
python
probability distributions
p-values
uncertainties
MCMC
STATISTICS
PROGRAMMING
MACHINE LEARNING
VISUALIZATION
DATA MUNGING
DATA INGESTION
what's left?
python
probability distributions
p-values
uncertainties
MCMC
regression
(linear, template)
classification
(trees, neural networks)
clustering
Time series analysis
Geospacial analysis?
the scientific method
(what is science?)
My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.
—Karl Popper, The Logic of Scientific Discovery
the demarcation problem:
what is science? what is not?
a scientific theory must be falsifiable
My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.
—Karl Popper, The Logic of Scientific Discovery
the demarcation problem:
what is science? what is not?
model
prediction
the demarcation problem
Einstein GR
the demarcation problem
model
prediction
Light rays are deflected by mass
model
prediction
data
does not falsify
falsifies
GR
still holds
GR
rejected
the demarcation problem
position of star changes during eclipse
position of star does not change during eclipse
is phsychology a science?
the demarcation problem
DISCUSS!
the demarcation problem
A theory can be said to be scientific if it makes falsifiable predictions
Experiments should be designed to falsify the predictions
Key Concept
the demarcation problem
things can get more complicated though:
most scientific theories are actually based largely on probabilistic induction and
modern inductive inference (Solomonoff, frequentist vs Bayesian methods...)
everything has **some** probability of happening. But it might be very small
traditional statistics works as follows:
- if the probability is smaller than some arbitrary cut (e..g p~0.05) then I will say that it is not true
the demarcation problem
things can get more complicated though:
most scientific theories are actually based largely on probabilistic induction and
modern inductive inference (Solomonoff, frequentist vs Bayesian methods...)
everything has **some** probability of happening. But it might be very small
traditional statistics works as follows:
- if the probability is smaller than some arbitrary cut (e..g p~0.05) then I will say that it is not true
- what about ML?? all it does is (1) make predictions (2) find structure in data
Text
Reproducible research means:
the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.
assures a result is grounded in evidence
1
#openscience
#opendata
Reproducible research means:
the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.
facilitates scientific progress by avoiding the need to duplicate unoriginal research
2
Reproducible research means:
the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.
facilitate collaboration and teamwork
3
Reproducible research in practice:
using the code and raw data provided by the analyst.
Claerbout, J. 1990,
Active Documents and Reproducible Results, Stanford Exploration Project Report, 67, 139
Reproducible research means:
the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.
all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)
Reproducible research means:
the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.
Reproducible research in practice:
using the code and raw data provided by the analyst.
all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)
A research product is reproducible if all numbers can be reproduced exactly be applying the same code to the same raw data.
It is the responsibility of the researcher to provide the data and code that make a research product reproducible
Key Concept
the tools
Reproducible research means:
all numbers in a data analysis can be recalculated exactly (down to stochastic variables!) using the code and raw data provided by the analyst.
Claerbout, J. 1990,
Active Documents and Reproducible Results, Stanford Exploration Project Report, 67, 139
allows reproducibility through code distribution
the Git software
is a distributed version control system:
a version of the files on your local computer is made also available at a central server.
The history of the files is saved remotely so that any version (that was checked in) is retrievable.
allows version control
collaboration tool
by fork, fork and pull request, or by working directly as a collaborator
allows effective collaboration
series of notebooks designed for Urban Science students by Dr. Mohit Sharma (in consultation with me)
recommended if you are brand new to python and coding or are serious about cleaning up your fundamentals
ignore the references to the CUSP working environment and work on https://colab.research.google.com/notebooks instead
quick bootcamp
recommanded if you know some python or if you know some other conding language reasonably proficiently
online book
PEP8: Python Enhancement Proposals 8
“This document gives coding conventions for the Python code comprising the standard library in the main Python distribution.”
A Python Bootcamp
1
install docker image from my account here
2
handle your own installation with python or anaconda (or whatever else on linux and windows) but make sure results are reproducible on google colab
you can ask coding questions, installation questions, colab questions...
you can ask coding questions, installation questions, colab questions...
you can ask coding questions, installation questions, colab questions...
it can be a toxic environment...
Science and Data Science
Falsifiability
Reproducibility
1
TBD
2
Jeff Leek & Rodger Peng. 2015,
What is the Question?
the original link:
https://science.sciencemag.org/content/347/6228/1314.summary
(this link nees access to science magazine, but ou can use the link above which is the same file)
Karl Popper, J. 1934,
The Logic of Scientific Discovery
http://strangebeautiful.com/other-texts/popper-logic-scientific-discovery.pdf
Claerbout, J. 1990,
Active Documents and Reproducible Results,
Stanford Exploration Project Report, 67, 139
http://sepwww.stanford.edu/data/media/public/docs/sep67/jon2/paper_html/
By federica bianco
intro to this class