Machine Learning for Time Series Analysis III

Linear Regression and Decomposition

Spring 2025 - UDel PHYS 664
dr. federica bianco

@fedhere

MLTSA:

class tools

python github google-colab stackoverflow

1

github

Reproducible research means:

 

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!) using the code and raw data provided by the analyst.

reproducibility

allows reproducibility through code distribution

github

the Git software

is a distributed version control system:

a version of the files on your local computer is made also available at a central server.

The history of the files is saved remotely so that any version (that was checked in) is retrievable.

version control

allows version control

github

collaboration tool

by fork, fork and pull request, or by working directly as a collaborator

collaborative platform

allows effective collaboration

If something is wrong with your repo I will communicate that with an Issue - please attend to your issues promptly


MLTSA:

Scientific method

Science Guiding Principles

2

epistemology: 

 

the philosophy of science and of the scientific method

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

what is science? what is not?

a scientific theory must be  falsifiable

My proposal is based upon an asymmetry between verifiability and falsifiability; an asymmetry which results from the logical form of universal statements. For these are never derivable from singular statements, but can be contradicted by singular statements.

the demarcation problem:

what is science? what is not?

model

prediction

the demarcation problem

Einstein GR

the demarcation problem

model

prediction

Light rays are deflected by mass

model

prediction

data

does not falsify

falsifies

GR

still holds

GR

rejected

the demarcation problem

position of star changes during eclipse

position of star does not change during eclipse

is astrology a science?

the demarcation problem

DISCUSS!

the demarcation problem

things can get more complicated though:

most scientific theories are actually based largely on probabilistic induction and

modern inductive inference  (Solomonoff,  frequentist vs Bayesian methods...)

 

the demarcation problem

A theory can be said to be scientific if it makes falsifiable predictions

Experiments should be designed to falsify the predictions

 

Key Concept

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

assures a result is grounded in evidence

1

#openscience

#opendata

 

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitates scientific progress by avoiding the need to duplicate unoriginal research 

2

Reproducibility

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

why?

facilitate collaboration and teamwork

3

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

Reproducibility

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducible research means:

 

the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.

  • provide raw data and code to reduce it to all stages needed to get outputs
 
  • provide code to reproduce all figures
  • provide code to reproduce all number outcomes

Reproducible research in practice:

 

 

 

using the code and raw data provided by the analyst.

all numbers in a data analysis can be recalculated exactly (down to stochastic variables!)

Reproducibility

Reproducibility

A research product is reproducible if all numbers can be reproduced exactly be applying the same code to the same raw data.

It is the responsibility of the researcher to provide the data and code that make a research product reproducible

Key Concept

MLTSA:

Linear Regression

Recap

3

Linear Regression

WHY?

Fitting a line

 ax+b

to data y

WHY?

Fitting a line

 ax+b

to data y

To predict and forecast

time (year)

See level contribution (mm)

Linear Regression

To explain

distance / age of the Universe

Universe's expansion rate

supernova (stellar explosion)

measure the expansion rate at the Universe as a function of time. 

 

Deviation from linear falsify an adiabatically expanding Universe

time (year)

WHY?

Fitting a line

 ax+b

to data y

To predict and forecast

See level contribution (mm)

Linear Regression

Key Concept

Model Fitting

We fit models to data in order to:

 

Predict and forecast: predict the value of the endogenous  (dependent) variable at locations of the exogenous (independent, time) variable where we have no observations. This can be within the observed range, or outside of the range, which in time-series means predict the future (forecast)

 

Explain: relate observed behavior to first principles or behavior of possibly variables to explain the evolution and assess causality.

E.g. fitting a parabola to a bouncing ball demonstrates that gravity (and initial velocity) explains the behavior

MLTSA:

Recap LR

analytical solution

3.0

Linear Regression

Normal Equation

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

 

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)
X = np.c_[np.ones((len(grbAG) - grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag

theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

Linear Regression

Normal Equation

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

 

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)
x = \begin{pmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{m} \end{pmatrix}

2xN       Nx2             2xN     Nx1

X = np.c_[np.ones((len(grbAG) - grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag

theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

It can be shown that the optimal parameters for a line fit to data without uncertainties is:

 

from sklearn.linear_model import LinearRegression
lr = LinearRegression()

X = np.c_[np.ones((len(grbAG) - 
	grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag
lr.fit(X, y)
lr.coef_, lr.intercept_

We can let sklearn solve the equation for us:

 

(X^T \cdot X)^{-1} \cdot X^T \cdot \vec{y} ~=~\left(\substack{a\\b}\right)

2x1

2xN       Nx2             2xN     Nx1

Linear Regression

Normal Equation

X = np.c_[np.ones((len(grbAG) - grbAG.upperlimit.sum(), 1)), 
	grbAG[grbAG.upperlimit == 0].logtime]
y = grbAG.loc[grbAG.upperlimit == 0].mag

theta_best = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

MLTSA:

Recap LR

Linear Correlation

3.1

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlated

"positively" correlated

r_{xy} = 1~\mathrm{iff}~y=ax\\ ~\mathrm{maximally~correlated}

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

anticorrelated

"negatively" correlated

r_{xy} = 1~\mathrm{iff}~y=-ax\\ ~\mathrm{maximally~anticorrelated}

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Pearson's correlation measures  linear correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

not linearly correlated

 

Pearson's coefficient = 0

 

does not mean that x and y are independent! 

\rho_{xy} = 1-\frac{6\sum_{i=1}^N(x_i - y_i)^2}{n(n^2-1)}

Pearson's correlation

Spearman's test

(Pearson's for ranked values)

correlation

\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}
r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)

Correlation does not imply causality!!

2 things may be related because they share a cause but not cause each other:

icecream sales with temperature |death by drowning

with temperature

In the era of big data you may encounter truly spurious correlations

divorce rate in Maine | consumption of Margarine

correlation

correlation

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)
import pandas as pd
df = pd.read_csv(file_name)
df.corr()
\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

import pandas as pd
df = pd.read_csv(file_name)
df.corr()
pl.imshow(vdf.corr(), clim=(-1,1),  cmap='RdBu')
pl.xticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.yticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.colorbar();

<- anticorrelated | correlated ->

correlation

Pearson's correlation

r_{xy} = \frac{1}{n-1}\sum_{i=1}^N\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)
import pandas as pd
df = pd.read_csv(file_name)
df.corr()
\bar{x} : \mathrm{mean~value~of~}x\\ \bar{y} : \mathrm{mean~value~of~}y\\ n: \mathrm{number~of~datapoints}\\ s_x ~=~\sqrt{\frac{1}{n-1}\sum_{i=1}^N(x_i - \bar{x})^2}

correlation

import pandas as pd
df = pd.read_csv(file_name)
df.corr()
pl.imshow(vdf.corr(), clim=(-1,1),  cmap='RdBu')
pl.xticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.yticks(list(range(len(df.corr()))),
                df.columns, rotation=45)
pl.colorbar();

MLTSA:

Regression

objective function

3.2

If there is no analytical solution

to select the "best" set of parameters we need a plan: we need to choose a function of the parameters to minimize or maximize

Objective Function

If there is no analytical solution

Objective Function

time

time

time

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

time

time

time

which is the "best fit" line? A , B, C, D?

A

B

C

D

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

time

time

time

which is the "best fit" line? A , B, C, D?

A

B

C

D

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

time

time

time

which is the "best fit" line? A , B, C, D?

A

B

C

D

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

time

time

time

which is the "best fit" line? A , B, C, D?

A

B

C

D

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

time

time

time

which is the "best fit" line? A , B, C, D?

A

B

C

D

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

time

time

time

L_1 = \sum_{i=1}^N|f(x) - y|
L_2 = \sum_{i=1}^N(f(x) - y)^2

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

Objective Function

L_1 = \sum_{i=1}^N|f(x) - y|
L_2 = \sum_{i=1}^N(f(x) - y)^2
\chi^2 = \sum_{i=1}^N\frac{(f(x) - y)^2}{\sigma^2}

chi square: relates to the likelihood if the distribution is Gaussian

to select the best fit parameters we define a function of the parameters to minimize or maximize

If there is no analytical solution

to select the "best" set of parameters we need a plan: we need to choose a function of the parameters to minimize or maximize

Objective Function

L_1 = \sum_{i=1}^N|f(x) - y|
L_2 = \sum_{i=1}^N(f(x) - y)^2
\chi^2 = \sum_{i=1}^N\frac{(f(x) - y)^2}{\sigma^2}
from scipy.optimize import minimize
def line(x, b, a):
    return a * x + b
def fitfunc(args, x, y):
    a, b = args
    return sum((y - line(a, b, x))**2)

x = grbAG.logtime.values
y = grbAG.mag.values
initialGuess = (10, 1)

fitfunc(initialGuess, x, y)
solution = minimize(fitfunc, initialGuess, args=(x, y))

If there is no analytical solution

to select the "best" set of parameters we need a plan: we need to choose a function of the parameters to minimize or maximize

Objective Function

L_1 = \sum_{i=1}^N|f(x) - y|
L_2 = \sum_{i=1}^N(f(x) - y)^2
\chi^2 = \sum_{i=1}^N\frac{(f(x) - y)^2}{\sigma^2}
from scipy.optimize import minimize
def line(x, b, a):
    return a * x + b
def chi2(args, x, y, s):
    a, b = args
    return sum((y - line(x, b, a))**2 / s)

x = grbAG.logtime.values
y = grbAG.mag.values
s = grbAG.magerr.values
initialGuess = (10, 1)

fitfunc(initialGuess, x, y)
solution = minimize(chi2, initialGuess, args=(x, y, s))
solution

Optimizing the Objective Function

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

loss

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

Minimum (optimal) loss

     a = 4

Optimizing the Objective Function

loss

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

How do we find the minimum if we do not know beforehand how the SSE curve looks like?

Optimizing the Objective Function

Minimum (optimal) loss

     a = 4

loss

3.1

stochastic gradient descent (SGD)

what is a machine learning?

1

2

3

4

  • start at a random point in the parameter space
  • calculate the loss
  • figure out how which direction of the parameter space change makes the loss smaller
  • change the parameters in that direction
  • recalculate the loss
  • take smaller steps the closer you are to the minimum

the algorithm: Stochastic Gradient Descent (SGD)

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

Minimum (optimal) loss

     a = 4

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

-1

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

loss

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a=a0

2. calculate the loss at a0 : SSE(a0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

loss

the algorithm: Stochastic Gradient Descent

for a line model   y = ax + b 

we need to find the "best" parameters a and b

1. choose initial value for a & b

2. calculate the SSE(a0,b0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

for a line model   y = ax + b 

we need to find the "best" parameters a and b

1. choose initial value for a & b

2. calculate the SSE(a0,b0)

3. calculate best direction to              go to decrease the SSE

4. step in that direction

5. go back to step 2 and repeat

the algorithm: Stochastic Gradient Descent

Things to consider:

loss

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

 

local minima

global minimum

loss

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

initialization: choosing starting spot?

global minimum

local minima

loss

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

-  initialization: choosing starting spot?

-  stopping criterion: when to stop?

 

loss

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

 

Stochastic Gradient Descent (SGD): use a different (random) sub-sample of the data at each iteration

also: try different starting points and do multiple minimization (computationally expensive)

also: sometimes also go uphill (MonteCarlo methods)

loss

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

-  initialization: choosing starting spot?

-  stopping criterion: when to stop?

-  learning rate: how far to step?

loss

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

-  initialization: choosing starting spot?

-  stopping criterion: when to stop?

-  learning rate: how far to step?

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

-  initialization: choosing starting spot?

-  stopping criterion: when to stop?

-  learning rate: how far to step?

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

-  initialization: choosing starting spot?

-  stopping criterion: when to stop?

-  learning rate: how far to step?

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

-  initialization: choosing starting spot?

-  stopping criterion: when to stop?

-  learning rate: how far to step?

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. take the gradient of the SSE and step in proportion

: the gradient is the slope of a line tangential to a point on a curve

\nabla x
\mathrm{Gradient~descent}\\ l.r. = \eta * \nabla x_{x=a_0}

the algorithm: Stochastic Gradient Descent

Things to consider:

-  local vs. global minima

-  initialization: choosing starting spot?

-  stopping criterion: when to stop?

-  learning rate: how far to step?

Adaptive learning rate: fast early on, slow later. Very common with Neural Networks

\mathrm{Gradient~descent}\\ l.r. = \eta * \nabla x_{x=a_0}

Cross validation

Cross validation

test train validation

train parameters on training set

run only once on the test set to assess the model performance

Cross validation

test + train + validation

train parameters on training set

adjust parameters on validation set

run only once on the test set to assess the model performance

Cross validation

k-fold cross validation

Cross validation

mlTSa:

Time Series Components

viz of the week

viz of the week

 

stacked area chart

Key Concepts

Reproduciblity: A research product is reproducible if all numbers can be reproduced exactly be applying the same code to the same raw data. It is the responsibility of the researcher to provide the data and code that make a research product reproducible

What is Machine Learning? Machine Learning models are parametrized representations of "reality"  where the parameters are learned from finite sets of realizations of that reality. Machine Learning is the discipline that conceptualizes, studies, and applies those models.

Model selection: Choosing a model i.e. a mathematical formula which we expect to be a simplified representation of our observations.

 

Objective Functions and optimization: To find the best model parameters we define a function of the data and parameters f(data, parameters) to be minimized or maximized.

Model fitting: Determining the best set of parameters to fit the observations within a chosen model.

Linear Regression as your ML first step: fitting a linear model (line or polynomial) to data is an approachable data analysis method that reveals _trends_ .

Homework

https://github.com/fedhere/MLTSA_FBianco/tree/main/HW2

 

 

 

https://www.chi2innovations.com/blog/discover-stats-blog-series/graphs-prove-correlation-not-causation/

Correlation is not Causation

reading

References

Data analysis recipes: Fitting a model to data

D. Hogg et al. https://arxiv.org/abs/1008.4686 - lots of details about how to properly treat outliers, uncertainties, assumptions in fitting a line to data. Witty comments make it entertaining. Exercise it make it very helpful

AstroML Chapter 10 -  Intro

HOMLwSKLKerasTF Chapter 4 pages 111-117

 

Elements of Statistical Learning Chapter 3 Section 1 and 2

Required reading

https://www.sigmacomputing.com/resources/learn/what-is-time-series-analysis

 

 

 

Additional

Reading

Data analysis recipes: Fitting a model to data

Intro and Chapter 1; pages 1-8

D. Hogg et al. https://arxiv.org/abs/1008.4686 

Lots of details about how to properly treat outliers, uncertainties, assumptions in fitting a line to data. Witty comments make it entertaining. Exercise it make it very helpful

 

Key Concepts

Falisifiability:  A theory can be said to be scientific if it makes falsifiable predictions. Experiments should be designed to falsify the predictions

Reproduciblity: A research product is reproducible if all numbers can be reproduced exactly be applying the same code to the same raw data. It is the responsibility of the researcher to provide the data and code that make a research product reproducible

What is special about time series? Time series are series of exogenous-endogenous variable pairs where the exogenous variable is time, and therefore it is a sequential quantity with a specific direction of evolution.

What is Machine Learning? Machine Learning models are parametrized representations of "reality"  where the parameters are learned from finite sets of realizations of that reality. Machine Learning is the discipline that conceptualizes, studies, and applies those models.

Objective Functions and optimization: To find the best model parameters we define a function of the data and parameters f(data, parameters) to be minimized or maximized.

Model fitting: Determining the best set of parameters to fit the observations within a chosen model.

References

Data analysis recipes: Fitting a model to data

D. Hogg et al. https://arxiv.org/abs/1008.4686 - lots of details about how to properly treat outliers, uncertainties, assumptions in fitting a line to data. Witty comments make it entertaining. Exercise it make it very helpful

AstroML Chapter 10 -  Intro

HOMLwSKLKerasTF Chapter 4 pages 111-117

 

Elements of Statistical Learning Chapter 3 Section 1 and 2

MLTSA_03 2025

By federica bianco

MLTSA_03 2025

Linear Regression and Decomposition

  • 136