Machine Learning for

Time Series Analysis VII

Machine Learning 101

Fall 2025 - UDel PHYS 661
dr. federica bianco

@fedhere

fbianco@udel.edu

this slide deck:

https://slides.com/federicabianco/mltsa07_2025

MLTSA:

supervised vs unsupervised learning

1

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

clustering

PCA

Apriori

k-Nearest Neighbors

Regression

Support Vector Machines

Classification/Regression Trees

Neural networks

observed features:

(x, y)

GOAL: partitioning data in maximally homogeneous,

maximally distinguished subsets.

clustering is unsupervised learning

x

y

all features are observed for all objects in the sample

(x, y)

how should I group the observations in this feature space?

e.g.: how many groups should I make?

x

y

clustering is unsupervised learning

all features are observed for all objects in the sample

(x, y)

how should I group the observations in this feature space?

e.g.: how small can clusters get?

x

y

clustering is unsupervised learning

unsupervised learning methods

(clustering)

find partitions of the space to discover structure

Dont need labels

There is generally no ground truth so the conclusions are heavily dependent on the assumptions and strategic choices

used to:

understand structure of feature space

clustering vs classifying

x

y

unsupervised

supervised

observed features:

(x, y)

models typically return a partition of the space

goal is to partition the space so that the unobserved variables are

separated in groups

consistently with

an observed subset

1

target features:

(color)

clustering vs classifying

t

y

unsupervised

observed features:

(x, y)

ax+b

if y <= a*t + b :
	return blue
else:
	return orange

target features:

(color)

supervised

t

y

observed features:

(x, y)

clustering vs classifying

unsupervised

if t**2 + y**2 <= (t-a)**2 + (y-b)**2 :
	return blue
else:
	return orange

target features:

(color)

supervised

supervised learning methods

(nearly all other methods you heard of)

learns by example

Need labels, in some cases a lot of labels
Dependent on the definition of similarity

Similarity can be used in conjunction to parametric or non-parametric methods

used to:

classify, predict (regression)

t

y

observed features:

(x, y)

supervised ML: classification

Tree Methods

split spaces along each axis separately

A subset of variables has class labels. Guess the label for the other variables

split along x

if t <= a :
	return blue
else:
	return orange

target features:

(color)

t

y

observed features:

(x, y)

Tree Methods

split spaces along each axis separately

A subset of variables has class labels. Guess the label for the other variables

split along x

if t <= a :
	if y <= b:
      		return blue
return orange

then

along y

target features:

(color)

supervised ML: classification

t

y

observed features:

(x, y)

A subset of variables has class labels. Guess the label for the other variables

split along x

if x <= a :
	if y <= b:
      		return blue
return orange

then

along y

target features:

(color)

supervised ML: classification

this makes it ideal when you have hybrid feature types (e.g. numerical vs categorical)

Tree Methods

split spaces along each axis separately

Tree Methods

example of supervised learning method

partitions feature space along each feature separately

The good

Non-Parametric
White-box: can be easily interpreted
Works with any feature type and mixed feature types
Works with missing data
Robust to outliers

The bad

High variability (-> use ensamble methods)
Tendency to overfit
(not really easily interpretable after all...)

MLTSA:

distance

2

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

L1 is the Minkowski distance with p=1

L2 is the Minkowski distance with p=2

D(i,j)~=~\sum_k|x_{ik}-x_{jk}|

D(i,j)~=~\sum_k|x_{ik}-x_{jk}|

D(i,j)~=~^{2}\sqrt{\sum_k(x_{ik}-x_{jk})^2}

D(i,j)~=~^{2}\sqrt{\sum_k(x_{ik}-x_{jk})^2}

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

D(i,j) > 0\\ D(i,i) = 0\\ D(i,j)~=~D(j,i)\\ D(i,j)~<=~D(i,k)~+~D(k,j)

D(i,j) > 0\\ D(i,i) = 0\\ D(i,j)~=~D(j,i)\\ D(i,j)~<=~D(i,k)~+~D(k,j)

properties:

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Euclidean: p=2

D_{Euc}(i,j)~=~\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^2}

D_{Euc}(i,j)~=~\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^2}

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

L1 is the Minkowski distance with p=1

L2 is the Minkowski distance with p=2

Residuals

3

2

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

L1 is the Minkowski distance with p=1

L2 is the Minkowski distance with p=2

Residuals

2

3

2

3

L1 = 7

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{1/p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{1/p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

L1 is the Minkowski distance with p=1

L2 is the Minkowski distance with p=2

Residuals

2

3

L1 = 7

L2 = 17

distance metrics

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Great Circle distance

D(i,j)~=~R~\arccos{\left( \sin{\phi}_i \cdot \sin{\phi_j} ~+~ \cos{\phi_i} \cdot \cos{\phi_j} \cdot \cos{\Delta\lambda} \right)}

D(i,j)~=~R~\arccos{\left( \sin{\phi}_i \cdot \sin{\phi_j} ~+~ \cos{\phi_i} \cdot \cos{\phi_j} \cdot \cos{\Delta\lambda} \right)}

\phi_i,\lambda_i,\phi_j,\lambda_j

\phi_i,\lambda_i,\phi_j,\lambda_j

features

latitude and longitude

continuous variables

distance metrics

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

M_{i=0,j=1}

categorical variables:

binary

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

distance metrics

Simple Matching Distance

SMD(i,j)~=~1~- SMC(i,j)

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

distance metrics

Simple Matching Distance

SMD(i,j)~=~1~- SMC(i,j)

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

distance metrics

Simple Matching Distance

SMD(i,j)~=~1~- SMC(i,j)

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

distance metrics

Simple Matching Distance

SMD(i,j)~=~1~- SMC(i,j)

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

distance metrics

Jaccard similarity

J(i,j)~=~\frac{M_{i=1,j=1}}{M_{i=0,j=1}+M_{i=1,j=0}+M_{i=0,j=0}}

J(i,j)~=~\frac{M_{i=1,j=1}}{M_{i=0,j=1}+M_{i=1,j=0}+M_{i=0,j=0}}

Jaccard distance

D(i,j)~=~1 - J(i,j)

D(i,j)~=~1 - J(i,j)

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

distance metrics

Jaccard similarity

J(i,j)~=~

J(i,j)~=~

Jaccard distance

D(i,j)~=~1 - J(i,j)

D(i,j)~=~1 - J(i,j)

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

B

B

{A\cap B}

{A\cap B}

A

A

{A\cup B}

{A\cup B}

\frac{A\cap B}{A\cup B}

\frac{A\cap B}{A\cup B}

categorical variables:

binary

distance metrics

https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

MLTSA:

distance in time series

2.1

distance in time series

simple time series distance:

Euclidean point by point

https://github.com/fedhere/MLTSA_FBianco/blob/master/CodeExamples/timeSeriesClustering.ipynb

D = \sqrt{\sum_k (y1_k - y2_k)^2}

D = \sqrt{\sum_k (y1_k - y2_k)^2}

distance in time series

simple time series distance:

Euclidean point by point

https://github.com/fedhere/MLTSA_FBianco/blob/master/CodeExamples/timeSeriesClustering.ipynb

D = \sqrt{\sum_k (y1_k - y2_k)^2}

D = \sqrt{\sum_k (y1_k - y2_k)^2}

distance in time series

https://github.com/fedhere/MLTSA_FBianco/blob/master/CodeExamples/timeSeriesClustering.ipynb

https://data.worldbank.org/indicator/SP.POP.TOTL

distance in time series

time series are vectors

example of distance metric that works on vectors:

correlation coefficient r

https://github.com/fedhere/MLTSA_FBianco/blob/master/CodeExamples/timeSeriesClustering.ipynb

distance in time series

how similar is similar? these time series are of the same phenomenon: eclipsing binaries. But they look different enough that it would be hard to write a classifier that recognizes this

distance in time series

what if the time series is shifted? : these are identical time series shifted along the x axis. The correlation coefficient would be low tho!

distance in time series

what if the time series is shifted? : these are identical time series shifted along the x axis. The correlation coefficient would be low tho!

what if the time series is stretched? : these are identical time series but the top one is stretched. Similarly the

distance in time series

One possible approach: DTW algorithm

Dynamic Time Warping (future class maybe?)

distance in time series

Another approach: learn features from data and measure distance between those

st deviation, min-max distance,

peaks in frequency space....

MLTSA:

CART

classification and regression trees

3

4.1 single tree

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

features:

gender
ticket class
age

target variable:

-> survival (y/n)

gender (binary)

M

Ns=93 Nd=360

F

Ns=197 Nd=64

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

features:

gender
ticket class
age

target variable:

-> survival (y/n)

gender (binary)

M

Ns=93 Nd=360

F

Ns=197 Nd=64

optimize over purity:

p~ = ~\frac{N_{largest~class}}{N_{total set}}

p~ = ~\frac{N_{largest~class}}{N_{total set}}

p=\frac{360}{360+93}

p=\frac{360}{360+93}

p=\frac{197}{197+64}

p=\frac{197}{197+64}

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

features:

gender
ticket class
age

target variable:

-> survival (y/n)

gender (binary)

M

Ns=93 Nd=360

F

Ns=197 Nd=64

optimize over purity:

p~ = ~\frac{N_{largest~class}}{N_{total set}}

p~ = ~\frac{N_{largest~class}}{N_{total set}}

p= 79\%

p= 79\%

p=75\%

p=75\%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

features:

gender 79%|75%
ticket class 66 | 54%
age

target variable:

-> survival (y/n)

1st

Ns=120 Nd=80

2nd +3rd

Ns=234 Nd=298

p= 66\%

p= 66\%

p=54\%

p=54\%

class (ordinal)

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

features:

gender 79%|75%
ticket class 66% | 54%
age 66% | 61%

target variable:

-> survival (y/n)

age (continuous)

>6.5

Ns=250 Nd=107

<=6.5

Ns=139 Nd=217

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

target variable:

-> survival (y/n)

gender (binary)

M

Ns=93 Nd=360

F

Ns=197 Nd=64

features:

gender 79|75%
ticket class M 60|85% F 96|65%
age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

target variable:

-> survival (y/n)

gender (binary)

M

Ns=93 Nd=360

F

Ns=197 Nd=64

features:

gender 79|75%
ticket class M 60|85% F 96|65%
age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

target variable:

-> survival (y/n)

gender

M

Ns=93 Nd=360

F

Ns=197 Nd=64

age

>6.5

Ns=250 Nd=107

<=6.5

Ns=139 Nd=217

class

1st + 2nd

Ns=120 Nd=80

3rd

Ns=234 Nd=298

features:

gender 79|75%
ticket class M 60|85% F 96|65%
age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

target variable:

-> survival (y/n)

gender

M

Ns=93 Nd=360

F

Ns=197 Nd=64

age

>6.5

Ns=250 Nd=107

p=82%

<=6.5

Ns=139 Nd=217

p=67%

class

age

>2.5

Ns=1 Nd=1

p=50%

<=2,5

Ns=8 Nd=139

p=95%

age

>38.5

Ns=44 Nd=46

<=38.5

Ns=11 Nd=1

1st + 2nd

Ns=120 Nd=80

3rd

Ns=234 Nd=298

features:

gender 79|75%
ticket class M 60|85% F 96|65%
age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

https://www.kaggle.com/c/titanic

features:

gender (binary already used)
ticket class (ordinal)
age (continuous)

target variable:

-> survival (y/n)

gender

M

Ns=93 Nd=360

F

Ns=197 Nd=64

>6.5

Ns=250 Nd=107

p=82%

<=6.5

Ns=139 Nd=217

p=67%

1st + 2nd

Ns=120 Nd=80

3rd

Ns=234 Nd=298

1st

Ns=100 Nd=20

p=80%

2nd

Ns=40 Nd=40

p=50%

age

>38.5

Ns=44 Nd=46

<=38.5

Ns=11 Nd=1

class

age

class

A single tree

nodes

(make a decision)

root node

branches

(split off of a node)

leaves (last groups)

https://github.com/fedhere/DSPS/blob/master/lab9/titanictree.ipynb

tree hyperparameters

gini impurity

\operatorname {I} _{G}(p)~=~1-\sum _{i=1}^{J}{p_{i}}^{2}

{\displaystyle \operatorname {I} _{G}(p)~=~1-\sum _{i=1}^{J}{p_{i}}^{2}}

information gain (entropy)

\mathrm {H} (T)~=-\sum _{i=1}^{J}{p_{i}\log _{2}p_{i}}

{\displaystyle \mathrm {H} (T)~=-\sum _{i=1}^{J}{p_{i}\log _{2}p_{i}}}

A single tree: hyperparameters

depth

A single tree: hyperparameters

max depth = 2

A single tree: hyperparameters

max depth = 2

PREVENTS OVERGFITTING

A single tree: hyperparameters

alternative: tree pruning

A single tree: hyperparameters

4.2 regression with trees

CART: Classification and Regression Trees

Trees can be used for regression

(think about it as very many small classes)

4.3 tree ensambles

issues with trees

variance:

different trees lead to different results

issues with trees

variance:

different trees lead to different results

why?

because calculating the criterion for every split and every mote is an untractable problem!

e.g. 2 coutinuous variables would be a problem of order

\infty^2

\infty^2

issues with trees

variance:

different trees lead to different results

solution

run many trees and take an "ensamble" decision!

Random Forests

Gradient Boosted Trees

a bunch of parallel trees

a series of trees

tree ensemble methods

Gradient boosted trees:

trees run in series (one after the other)

each tree uses different weights for the features learning the weighs from the previous tree

the last tree has the prediction

Random forest:

trees run in parallel (independently of each other)

each tree uses a random subset of observations/features (boostrap - bagging)

class predicted by majority vote:

what class do most trees think a point belong to

feature importance

In principle CART methods are interpretable

you can measure the influence that each feature has on the decision : feature importance

4.4

https://github.com/fedhere/DSPS/blob/master/lab9/titanictree.ipynb

feature importance

In principle CART methods are interpretable

you can measure the influence that each feature has on the decision : feature importance

In practice the interpretation is complicated by covariance of features

MLTSA:

Extraction of features

3

Consider a classification task:

if I want to use machine learning methods I need to choose:

use raw representation:

e.g. clustering:

1) take each time series and standardize it

(mean 0 standard 1).

2) for each time stamps compare them to the expected value (mean and stdev)

essentially each datapoint is treated as a feature

Consider a classification task:

if I want to use machine learning methods (e.g. clustering) I need to choose:

use raw representation

1) take each time series and standardize it (μ=0 ; σ=1).

2) for each time stamps compare them to the expected value (μ & σ)

problems:

scalability: for N time series of lenght d the dataset has dimension Nd
time series may be asynchronous
time series may be warped

(in small dataset you can optimize over warping and shifting but in large dataset this solution is computationally limited)

essentially each datapoint is treated as a feature

Consider a classification task:

if I want to use machine learning methods (e.g. clustering) I need to choose:

choose a low dimensional representation

essentially each datapoint is treated as a feature

Extract features that describe the time series:

simple descriptive statistics (look at the distribution of points, regardless of the time evolution:

mean
standard deviation
other moments (skewness, kurtosis)

parametric features (based on fitting model to data):

slope of a line fit
intercept of a line fit
best ARMA parameters

Consider a classification task:

the learned representations should:

preserve the pairwise similarities and serve as feature vectors for machine learning methods;
lower bound the comparison function to accelerate similarity search;
allow using prefixes of the representations (by ranking their coordinates in descending order of importance) for scaling methods under limited resources;
support efficient and memory-tractable computation for new data to enable operations in online settings; and
support efficient and memory-tractable eigendecomposition of the datato-data similarity matrix to exploit highly effective methods that rely on such cornerstone operation.

http://www.vldb.org/pvldb/vol12/p1762-paparrizos.pdf

resources

a comprehensive review of clustering methods

Data Clustering: A Review, Jain, Mutry, Flynn 1999

https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

a blog post on how to generate and interpret a scipy dendrogram by Jörn Hees
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

clustering

a model that uses clustering to generate features http://www.vldb.org/pvldb/vol12/p1762-paparrizos.pdf

a blog post on how to generate and interpret a scipy dendrogram by Jörn Hees
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

key concepts

Clustering : unsupervised learning where all features are observed for all datapoints. The goal is to partition the space into maximally homogeneous maximally distinguished groups

clustering is easy, but interpreting results is tricky

Distance : A definition of distance is required to group observations/ partition the space.

Common distances over continuous variables

Minkowski (inlcudes Euclidian = Minkowski(2)
~~Great Circle (for coordinated on a sphere, e.g. earth or sky)~~

~~Common distances over categorical variables:~~

~~Simple Distance Matrix~~
~~Jaccard Distance~~

Clustering : k-means - pros: its efficient and intuitive, cons: only works on euclidean distance, need to know the number of clusters

hierarchical: cons: less efficient, pros: provides the full length clustering tree

resources

http://what-when-how.com/artificial-intelligence/decision-tree- applications-for-data-modelling-artificial-intelligence/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4466856/

CART

key concepts

Machine Learning includes models that learn parameters from data

ML models have parameters learned from the data and hyperparameters assigned by the user.

Unsupervised learning:

all variables observed for all data points
learns the structure of the features space from the data
predicts a label (group of belonging) based on similarity of all features

Supervised learning:

a target feature is observed only for a subset of the data
learns target feature for data where it is not observed based on similarity of the other features
predicts a class/value for each datum without observed label

Tree methods: