Machine Learning for Time Series Analysis I

Spring 2025 - UDel PHYS 664
dr. federica bianco

@fedhere

fbianco@udel.edu

epistemology

1 logistics

2 tools

jupyter notebooks

3 what is a time series

4 what is science

falsifiability

reproducibility

this slide deck:

https://slides.com/federicabianco/mltsa25_01

MLTSA: logistics

grading, communication

MLTSA: logistics

Dr. Federica Bianco - fbianco@udel.edu

Office hours:

???

Location: Sharp Lab 209

Available on Slack.... a lot

MLTSA: logistics

Class syllabus - https://bit.ly/mltsa25_syllabus

UD Canvas - used for announcements and grading

Class slack - you will access it by filling in a form as first assignment. The form will ensure you read the code of conduct and syllabus for the class

(will need to gain invitation by filling in the class form)

Communication

Class syllabus - http://bit.ly/MLTSASyllabus

UD Canvas - used for announcements and grading

Class slack -

MLTSA: logistics

most communications will happen on slack
I am far more responsive to slack messages than any other way of communication
other members of the class can also help on slack
each assignment will have its own slack dedicated channel for Q/A (e.g. #hw1)
If you find a mistake/bug in my notes/instructions report on #bugsandissues on slack and earn +2 point in the homework assignment!

Communication

MLTSA: logistics

There are several textbooks listed on the syllabus . However, since none of them will fully cover the curriculum I want to cover they may be useful but will not be sufficient. Truly the slides will be the main resource for reviewing the material covered. Therefore, I urge you not to rush to spend a large amount of money on buying them. Start the class as see if you think they would help you, and which ones would. I also asked the physics library to acquire copies of the books and you can also ask to borrow mine from time to time.

Elements of Statistical Learning, Hastie,Tibshirani,Friedman, Springer 2001 - available for free at the link provided here

Is a foundational textbook for machine learning

Statistics, Data Mining, and Machine Learning in Astronomy, Ivezic, Connoly, VanderPlas, Gray, 2nd edition 2019

It is an application textbook that shows specifically applications of ML to astrophysics, and a good chunk of astrophysics deals with time series. Note that the second edition just appeared.

Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Aurélien Géron O'Reilly Media 2019

probably the book that is closer to the syllabus in terms of techniques, but don’t buy it, because the second edition is due to come out imminently and the deep learning chapters of the previous edition are out of date now

More books about python may be useful depending on your background.

Dive into Deep Learning free resource, Interactive deep learning book with code, math, and discussions Implemented with PyTorch, NumPy/MXNet, JAX, and TensorFlow

Coding if you are not familiar with python

Python Data Science Handbook, Jake VanderPlas, O'Reilly Media 2016 (for free at this link, just click download)

Beginning Python Visualization, Shai Vaingast 2009 (great for python beginners) - available for free at the link provided here

Visualizations:

Visualizations Analysis and Design, T. Munzer, 2014 not free

Class resources

MLTSA: logistics

Open textbook on forecasting http://otexts.com/fpp2

Forecasting: Principles and Practice
Rob J Hyndman and George Athanasopoulos

Class resources

MLTSA: logistics

Be respectful, be kind, be supportive. This class has a Code of Conduct: read and answer the CoC survey.
Show up to class and show up in time
Review the class material after every class, before the next class
Do the required reading
Participate in in class-work, ask and answer questions
Turn in homework assignments on time
Read the deliverables required in each homework. In the syllabus (for requirements that hold for all assignments) and in each homework folder readme.
Even if you are only auditing, come to class regularly. You do not have to participate to in class work but I recommend you do!

What I expect from you

MLTSA: logistics

15% pre-class questions
20% labs performance and participation
15% homework
20% midterm (project proposal)
30% ﬁnal (project)

Grades are based on:

every (... most) class will start with a quizz. You will have up to 10 minutes to complete the quizz. The quizz will be based on content of previous lectures and assigned reading.

MLTSA: logistics

15% pre-class questions
20% labs performance and participation
15% homework
20% midterm (project proposal)
30% ﬁnal (project)

midterm and final projects

Team projects: minimum 3 people group, max 4 people group

Deliverables:

report + presentation (slides and live presentation) for both proposal and final project

(templates are available in http://bit.ly/MLTSA20drive)

Deliver on your own repo

github.com/MLTSA25_<Firstinitial><Lastname>/HW<number>

MLTSA: logistics

midterm and final projects

Please work in groups of up to 5 people on homework as a collaborative projects.

Individual notebooks must be returned for each homework. Different group members should lead different aspects of the work. A statement must be included in the README explaining each team member’s contribution (similar to an acknowledge of contribution you would find in a Nature letter see, for example these contributions).

MLTSA: logistics

Think about a dataset from your own field that we can use. Submit the dataset here
Get up and code! We will have collaborative coding sessions. I will be coding in front of you all and ask you to volunteer and code. You would learn a lot by doing it!
Ask questions! If you have a doubt most likely many other classmates do. I will not judge you for the questions you ask, but I will reward you for asking questions.
Answer questions and participate to the discussions we have sin class. Verbalizing concepts is the best way to internalize them and clarify them to yourself.
Participate to in-class coding activities. In general these activities will start in class and will be turned in as homework. The more you do in class the less you have to do as homework

Things that count for participation
(20% of the grade)

MLTSA:

class tools

python github google-colab stackoverflow

github

https://github.com

Reproducible research means:

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!) using the code and raw data provided by the analyst.

Claerbout, J. 1990,

Active Documents and Reproducible Results, Stanford Exploration Project Report, 67, 139

reproducibility

allows reproducibility through code distribution

github

https://github.com

the Git software

is a distributed version control system:

a version of the files on your local computer is made also available at a central server.

The history of the files is saved remotely so that any version (that was checked in) is retrievable.

version control

allows version control

github

collaboration tool

by fork, fork and pull request, or by working directly as a collaborator

collaborative platform

allows effective collaboration

https://github.com

python

https://www.economist.com/graphic-detail/2018/07/26/python-is-becoming-the-worlds-most-popular-coding-language

intuitive and readable
open source
support C integration for performance
packages designed for science:
- scipy
- statsmodels
- numpy (computation)
- sklearn (machine learning)

python

series of notebooks designed for Urban Science students by Dr. Mohit Sharma (in consultation with me)

recommanded if you are brand new to python and coding or are serious about cleaning up your foundamentals

https://sharmamohit.com/work/tutorials/ucsl/

python

https://github.com/fedhere/PyBOOT

quick bootcamp

recommanded if you know some python or if you know some other conding language reasonably proficiently

python

https://www.southampton.ac.uk/~fangohr/training/python/pdfs/Python-for-Computational-Science-and-Engineering.pdf

online book

python

PEP8: Python Enhancement Proposals 8

“This document gives coding conventions for the Python code comprising the standard library in the main Python distribution.”

Indentation, Tabs vs Spaces, Maximum Line Length, Blank Lines, Source File Encoding, Imports, Whitespace in Expressions and Statements , Imports, Comments Bookeeping, Naming

Jupyter Notebook Google Colaboratory

it can be a toxic environment...

Intermission:

entry survey

https://forms.gle/kzVBQ4K8b1f48oWz8

https://bit.ly/mltsasurvey25

MLTSA: epistemology

Science Guiding Principles

epistemology:

the philosophy of science and of the scientific method

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

—Karl Popper, The Logic of Scientific Discovery

the demarcation problem:

what is science? what is not?

a scientific theory must be falsifiable

—Karl Popper, The Logic of Scientific Discovery

the demarcation problem:

what is science? what is not?

model

prediction

the demarcation problem

Einstein GR

the demarcation problem

model

prediction

Light rays are deflected by mass

http://discovermagazine.com/2019/may/why-it-took-the-1919-solar-eclipse-for-physicists-to-believe-einstein

model

prediction

data

does not falsify

falsifies

still holds

rejected

the demarcation problem

position of star changes during eclipse

position of star does not change during eclipse

is astrology a science?

the demarcation problem

DISCUSS!

the demarcation problem

things can get more complicated though:

most scientific theories are actually based largely on probabilistic induction and

modern inductive inference (Solomonoff, frequentist vs Bayesian methods...)

the demarcation problem

A theory can be said to be scientific if it makes falsifiable predictions

Experiments should be designed to falsify the predictions

Key Concept

Reproducibility

Reproducible research means:

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

Reproducibility

Reproducible research means:

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

why?

assures a result is grounded in evidence

#openscience

#opendata

Reproducibility

Reproducible research means:

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

why?

facilitates scientific progress by avoiding the need to duplicate unoriginal research

Reproducibility

Reproducible research means:

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

why?

facilitate collaboration and teamwork

Reproducible research in practice:

using the code and raw data provided by the analyst.

Claerbout, J. 1990,

Active Documents and Reproducible Results, Stanford Exploration Project Report, 67, 139

Reproducible research means:

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

Reproducibility

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducible research means:

https://acmedsci.ac.uk/viewFile/56314e40aac61.pdf

provide raw data and code to reduce it to all stages needed to get outputs

provide code to reproduce all figures

provide code to reproduce all number outcomes

Reproducible research in practice:

using the code and raw data provided by the analyst.

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducibility

A research product is reproducible if all numbers can be reproduced exactly be applying the same code to the same raw data.

It is the responsibility of the researcher to provide the data and code that make a research product reproducible

Key Concept

MLTSA:

what are

time series

mlTSa: what is a time series?

Consider a dataset that is a time series

1D: exogenous-endogenous variable

time

y depend on x

Consider a dataset that is a time series

1D: exogenous-endogenous variable

time

y depend on x

exogenous variable is sequencial

time has an directionality:

y(t+1) depends on y(t)

mlTSa: what is a time series?

Consider a dataset that is a time series

1D: exogenous-endogenous variable

time

y depend on x

exogenous variable is sequencial

time has an directionality:

y(t+1) depends on y(t)

mlTSa: what is a time series?

you may have uncertainties...

Consider a dataset that is a time series

1D: exogenous-endogenous variable

time

y depend on x

exogenous variable is sequencial

time has an directionality:

y(t+1) depends on y(t)

mlTSa: what is a time series?

you may have uncertainties...

A time series is any measurable quantity sampled at multiple points in time.

Time series are series of exogenous-endogenous variable pairs where the exogenous variable is time, and therefore it is a sequential quantity with a specific direction of evolution.

Key Concept

mlTSa: what is a time series?

Consider a dataset that is not a time series

https://colab.research.google.com/drive/1Rkis-2ZuqPuPpQYHK_QtrBLgBVCW1g2H

mlTSa: what is a time series?

Evenly vs Unevenly sampled time series.

Most statistical methods are developed for evenly sampled TS.

Most physical TS are unevenly sampled

time

mlTSa: what is a time series?

time

evenly: dt is constant

unevenly: dt changes

Target measurements:

1) measuring intensity as a function of time

e.g.

- stellar photometry

- brain electrical activity

2) measuring arrival time

e.g.

- high energy cosmic rays counting.

- atomic decay with geiger counts

time

mlTSa: what is a time series?

time

https://en.wikipedia.org/wiki/Geiger_counter

MLTSA:

time series

analsyis topics

mlTSa:

Statistical analysis on time-domain data

Trend detection

mlTSa:

Statistical analysis on time-domain data

mlTSa:

Statistical analysis on time-domain data

periodicity/seasonality detection

Flux

mlTSa:

Statistical analysis on time-domain data

periodicity/seasonality detection

Flux

mlTSa:

Statistical analysis on time-domain data

mlTSa:

Statistical analysis on time-domain data

event detection

mlTSa:

Statistical analysis on time-domain data

event detection

mlTSa:

Statistical analysis on time-domain data

https://www.neurologic.theclinics.com/article/S0733-8619(05)00041-1/fulltext

The 3 behavioral states of wakefulness, rapid eye movement (REM) sleep, and non-REM (NREM)

sleep are characterized by specific changes in electroencephalography,

Point of change detection

Longitudinal Employer-Household Dynamics

https://lehd.ces.census.gov/data/

mlTSa:

Statistical analysis on time-domain data

Longitudinal Employer-Household Dynamics

https://lehd.ces.census.gov/data/

mlTSa:

Statistical analysis on time-domain data

seasonal variations,

cyclic variation,

periodicity

Forecasting/Prediction

https://www.climate.gov/news-features/understanding-climate/climate-change-global-sea-level

mlTSa:

Statistical analysis on time-domain data

Trend detection
Periodicity/seasonality detection
Event detection / Anomaly detection
Point of change detection
Forecasting/prediction
Classification

mlTSa:

Statistical analysis on time-domain data

mlTSa:

Time Series Components

mlTSa:

Time Series Components

mlTSa:

Time Series Components

mlTSa:

Time Series Components

mlTSa:

Time Series Components

Missing data
Sparse measurements (the time interval is not regular)
Covariance of the measurements
Correlation of the errors (homescedastic errors)

mlTSa:

Common issues

MLTSA:

Statistical analysis on 1D sequencial data1

It is also a series of exogenous-endogenous variable pairs where the exogenous variable ~~is time, and therefore it~~ is a sequential quantity with a specific direction of evolution.

A spectrum is a measurable quantity sampled at multiple points in wavelength.

MLTSA:

what is

machine learning

MLtsa:

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

model

parameters: slope (a), intercept (b)

y = ax + b

Model:

a mathematical formula with parameters

MLtsa:

what is machine learning?

Model:

a mathematical formula with parameters

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

y = ax + b

model variable: x - for us this will always be time

model

parameters: slope (a), intercept (b)

MLtsa:

what is machine learning?

Data:

a set of observations

Model:

a mathematical formula with parameters

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

MLtsa:

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

Data:

a set of observations

Model:

a mathematical formula with parameters

for every parameter there are an infinity of models

MLtsa:

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

Data:

a set of observations

Model:

a mathematical formula with parameters

Use the data to learn the parameters of the model

MLtsa:

what is machine learning?

Machine Learning models are parametrized representation of "reality" where the parameters are learned from finite sets of realizations of that reality

Machine Learning is the disciplines that conceptualizes, studies, and applies those models.

Key Concept

MLtsa:

what is machine learning?

MLtsa:

Use the data to learn the parameters of the model

the best way to think about it in the ML context:

a model is a low dimensional representation of a higher dimensionality dataset

MLTSA:

Linear Regression

WHY?

Fitting a line

ax+b

to data y

https://desdemonadespair.net/2019/12/greenland-losing-ice-seven-times-faster-than-in-the-1990s-sea-level-rise-from-greenland-melt-tracking-highest-climate-projections.html

WHY?

Fitting a line

ax+b

to data y

To predict and forecast

time (year)

See level contribution (mm)

Linear Regression

To explain

distance / age of the Universe

Universe's expansion rate

supernova (stellar explosion)

measure the expansion rate at the Universe as a function of time.

Deviation from linear falsify an adiabatically expanding Universe

time (year)

https://desdemonadespair.net/2019/12/greenland-losing-ice-seven-times-faster-than-in-the-1990s-sea-level-rise-from-greenland-melt-tracking-highest-climate-projections.html

https://github.com/fedhere/DSPS/blob/master/HW6/SNdataLineFit_solution.ipynb

WHY?

Fitting a line

ax+b

to data y

To predict and forecast

See level contribution (mm)

Linear Regression

Key Concept

Model Fitting

We fit models to data in order to:

Predict and forecast: predict the value of the endogenous (dependent) variable at locations of the exogenous (independent, time) variable where we have no observations. This can be within the observed range, or outside of the range, which in time-series means predict the future (forecast)

Explain: relate observed behavior to first principles or behavior of possibly variables to explain the evolution and assess causality.

E.g. fitting a parabola to a bouncing ball demonstrates that gravity (and initial velocity) explains the behavior

MLTSA:

Linear Regression

analytical solution

6.1

Linear Regression

Normal Equation

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)

X = np.c_[np.ones((len(grbAG) - grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag

theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

Linear Regression

Normal Equation

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)

x = \begin{pmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{m} \end{pmatrix}

2xN Nx2 2xN Nx1

X = np.c_[np.ones((len(grbAG) - grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag

theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

X = np.c_[np.ones((len(grbAG) - 
	grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag
lr.fit(X, y)
lr.coef_, lr.intercept_

We can let sklearn solve the equation for us:

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)

2x1

2xN Nx2 2xN Nx1

Linear Regression

Normal Equation

X = np.c_[np.ones((len(grbAG) - grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag

theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

MLTSA:

Linear Regression

linear correlation

6.1

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

r_{xy} = 1~\mathrm{iff}~y=ax\\ ~\mathrm{maximally~correlated}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

r_{xy} = 1~\mathrm{iff}~y=-ax\\ ~\mathrm{maximally~anticorrelated}

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

not linearly correlated

Pearson's coefficient = 0

does not mean that x and y are independent!

\rho_{xy} = 1-\frac{6\sum_{i=1}^N(x_i - y_i)^2}{n(n^2-1)}

Pearson's correlation

Spearman's test

(Pearson's for ranked values)

correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Correlation does not imply causality!!

2 things may be related because they share a cause but not cause each other:

icecream sales with temperature |death by drowning

with temperature

In the era of big data you may encounter truly spurious correlations

divorce rate in Maine | consumption of Margarine

correlation

http://www.tylervigen.com/spurious-correlations

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

import pandas as pd
df = pd.read_csv(file_name)
df.corr()

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

import pandas as pd
df = pd.read_csv(file_name)
df.corr()

pl.imshow(vdf.corr(), clim=(-1,1),  cmap='RdBu')
pl.xticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.yticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.colorbar();

<- anticorrelated | correlated ->

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

import pandas as pd
df = pd.read_csv(file_name)
df.corr()

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

import pandas as pd
df = pd.read_csv(file_name)
df.corr()

pl.imshow(vdf.corr(), clim=(-1,1),  cmap='RdBu')
pl.xticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.yticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.colorbar();

MLTSA:

Regression

objective function

6.2

If there is no analytical solution

to select the "best" set of parameters we need a plan: we need to choose a function of the parameters to minimize or maximize

Objective Function

If there is no analytical solution

Objective Function

L_1 = \sum_{i=1}^N|f(x) - y|

L_2 = \sum_{i=1}^N(f(x) - y)^2

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

L_1 = \sum_{i=1}^N|f(x) - y|

L_2 = \sum_{i=1}^N(f(x) - y)^2

\chi^2 = \sum_{i=1}^N\frac{(f(x) - y)^2}{\sigma^2}

chi square: relates to the likelihood if the distribution is Gaussian

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

to select the "best" set of parameters we need a plan: we need to choose a function of the parameters to minimize or maximize

Objective Function

L_1 = \sum_{i=1}^N|f(x) - y|

L_2 = \sum_{i=1}^N(f(x) - y)^2

\chi^2 = \sum_{i=1}^N\frac{(f(x) - y)^2}{\sigma^2}

from scipy.optimize import minimize
def line(x, b, a):
    return a * x + b
def fitfunc(args, x, y):
    a, b = args
    return sum((y - line(a, b, x))**2)

x = grbAG.logtime.values
y = grbAG.mag.values
initialGuess = (10, 1)

fitfunc(initialGuess, x, y)
solution = minimize(fitfunc, initialGuess, args=(x, y))

If there is no analytical solution

to select the "best" set of parameters we need a plan: we need to choose a function of the parameters to minimize or maximize

Objective Function

L_1 = \sum_{i=1}^N|f(x) - y|

L_2 = \sum_{i=1}^N(f(x) - y)^2

\chi^2 = \sum_{i=1}^N\frac{(f(x) - y)^2}{\sigma^2}

from scipy.optimize import minimize
def line(x, b, a):
    return a * x + b
def chi2(args, x, y, s):
    a, b = args
    return sum((y - line(x, b, a))**2 / s)

x = grbAG.logtime.values
y = grbAG.mag.values
s = grbAG.magerr.values
initialGuess = (10, 1)

fitfunc(initialGuess, x, y)
solution = minimize(chi2, initialGuess, args=(x, y, s))
solution

viz of the week

W.E.B. DuBois

W.E.B. Du Bois

February 23, 1868 – August 27, 1963

American sociologist, socialist, historian, civil rights activist, Pan-Africanist, author, writer and editor

https://inspirehep.net/record/1082448/plots

After graduating with a Ph.D. in history from Harvard University, W.E.B. Du Bois, the prominent African-American intellectual, sought a way to process all this information showing why the African disapora in America was being held back in a tangible, contextualized form.

“The colorful charts, graphs, and maps presented at the 1900 Paris Exposition by famed sociologist and black rights activist W. E. B. Du Bois offered a view into the lives of black Americans, conveying a literal and figurative representation of 'the color line'."

https://www.smithsonianmag.com/history/first-time-together-and-color-book-displays-web-du-bois-visionary-infographics-180970826/

W.E.B. Du Bois 1868-1963, sociologist, black right activist, graphic designer ante litteram

“Du Bois was aware that while unmoving prose and dry presentations of charts and graphs might catch attention from specialists, this approach would not garner notice beyond narrow circles of academics,” Aldon Morris writes in the essay “American Negro at Paris, 1900.” “Such social science was useless to the liberation of oppressed peoples. Breaking from tradition, Du Bois was among the first great American public intellectuals whose reach extended beyond the academy to the masses.”

https://hyperallergic.com/476334/how-w-e-b-du-bois-meticulously-visualized-20th-century-black-america/

accurate gaphic. but not impactful

accurate gaphic. also not very impactful

"ineffective" use of space?

your visualization should be

- true to the data

- compelling

but compelling may mean different things for different audiences. W.E.B. DuBois was using these graphs for advocacy, not scientific communication.

His plate are mathematically accurate, but emotionally engaging to developed an emotional response (empathy) in the audience.

https://colab.research.google.com/drive/14DymfcW8Qm6vfmC-mOOql-aljRvZ8Wng?usp=sharing

#DuBoisChallenge

https://nightingaledvs.com/the-dubois-challenge/?utm_campaign=SF%20Data%20Weekly&utm_medium=email&utm_source=Revue%20newsletter

Key Concepts

Reproduciblity: A research product is reproducible if all numbers can be reproduced exactly be applying the same code to the same raw data. It is the responsibility of the researcher to provide the data and code that make a research product reproducible

What is special about time series? Time series are series of exogenous-endogenous variable pairs where the exogenous variable is time, and therefore it is a sequential quantity with a specific direction of evolution.

What is Machine Learning? Machine Learning models are parametrized representations of "reality" where the parameters are learned from finite sets of realizations of that reality. Machine Learning is the discipline that conceptualizes, studies, and applies those models.

Model selection: Choosing a model i.e. a mathematical formula which we expect to be a simplified representation of our observations.

Objective Functions and optimization: To find the best model parameters we define a function of the data and parameters f(data, parameters) to be minimized or maximized.

Model fitting: Determining the best set of parameters to fit the observations within a chosen model.

Homework

https://github.com/fedhere/MLTSA_FBianco/tree/main/HW1

Required reading

https://www.sigmacomputing.com/resources/learn/what-is-time-series-analysis

Additional

Reading

Data analysis recipes: Fitting a model to data

Intro and Chapter 1; pages 1-8

D. Hogg et al. https://arxiv.org/abs/1008.4686

Lots of details about how to properly treat outliers, uncertainties, assumptions in fitting a line to data. Witty comments make it entertaining. Exercise it make it very helpful

Key Concepts

Falisifiability: A theory can be said to be scientific if it makes falsifiable predictions. Experiments should be designed to falsify the predictions

Model selection: Choosing a model i.e. a mathematical formula which we expect to be a simplified representation of our observations.

Objective Functions and optimization: To find the best model parameters we define a function of the data and parameters f(data, parameters) to be minimized or maximized.

Model fitting: Determining the best set of parameters to fit the observations within a chosen model.

References

8 ways to do linear regression and measure their speed

Data analysis recipes: Fitting a model to data

D. Hogg et al. https://arxiv.org/abs/1008.4686 - lots of details about how to properly treat outliers, uncertainties, assumptions in fitting a line to data. Witty comments make it entertaining. Exercise it make it very helpful

AstroML Chapter 10 - Intro

HOMLwSKLKerasTF Chapter 4 pages 111-117

Elements of Statistical Learning Chapter 3 Section 1 and 2

MLTSA_01 2025

By federica bianco

MLTSA_01 2025

intro to time series and regression

federica bianco PRO

astro | data science | data for good

Machine Learning for Time Series Analysis I

epistemology

1 logistics

2 tools

github

python

jupyter notebooks

google colab

kaggle

stackoverflow

3 what is a time series

4 what is science

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA: logistics

MLTSA:

class tools

github

reproducibility

github

version control

github

collaborative platform

python

python

python

python

python

Indentation, Tabs vs Spaces, Maximum Line Length, Blank Lines, Source File Encoding, Imports, Whitespace in Expressions and Statements , Imports, Comments Bookeeping, Naming

Jupyter Notebook Google Colaboratory

stackoverflow

for when you need help

stackoverflow

for when you need help

stackoverflow

for when you need help

Intermission:

MLTSA: epistemology

epistemology:

the philosophy of science and of the scientific method

Reproducibility

Reproducibility

why?

Reproducibility

why?

Reproducibility

why?

Reproducibility

Reproducibility

Reproducibility

MLTSA:

what are

time series

mlTSa: what is a time series?

mlTSa: what is a time series?

mlTSa: what is a time series?

mlTSa: what is a time series?

mlTSa: what is a time series?

mlTSa: what is a time series?

mlTSa: what is a time series?

mlTSa: what is a time series?

MLTSA:

time series

analsyis topics

mlTSa:

Statistical analysis on time-domain data

mlTSa:

Statistical analysis on time-domain data

mlTSa:

Statistical analysis on time-domain data

Flux

mlTSa: