Machine Learning for

Time Series Analysis VII

Machine Learning 101

Fall 2025 - UDel PHYS 661
dr. federica bianco 

 

@fedhere

MLTSA:

 

supervised vs unsupervised learning

1

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

clustering

PCA

Apriori

k-Nearest Neighbors

Regression

Support Vector Machines

Classification/Regression Trees

Neural networks

 

observed features:

(x, y)

GOAL: partitioning data in  maximally homogeneous,

maximally distinguished  subsets.

clustering is unsupervised learning 

x

y

all features are observed for all objects in the sample

(x, y)

how should I group the observations in this feature space?

e.g.: how many groups should I make?

x

y

clustering is unsupervised learning 

all features are observed for all objects in the sample

(x, y)

how should I group the observations in this feature space?

e.g.: how small can clusters get?

x

y

clustering is unsupervised learning 

unsupervised learning methods

(clustering)

find partitions of the space to discover structure

  • Dont need labels

  • There is generally no ground truth so the conclusions are heavily dependent on the assumptions and strategic choices

used to:

understand structure of feature space

 

clustering vs classifying

x

y

unsupervised

supervised

observed features:

(x, y)

models typically return a partition of the space

goal is to partition the space so that the unobserved variables are

          separated in groups

consistently with

an observed subset

1
1

target features:

(color)

clustering vs classifying

t

y

unsupervised

observed features:

(x, y)

ax+b

if y <= a*t + b :
	return blue
else:
	return orange

target features:

(color)

supervised

t

y

observed features:

(x, y)

clustering vs classifying

unsupervised

if t**2 + y**2 <= (t-a)**2 + (y-b)**2 :
	return blue
else:
	return orange

target features:

(color)

supervised

supervised learning methods

(nearly all other methods you heard of)

learns by example

  • Need labels, in some cases a lot of labels
  • Dependent on the definition of similarity

  • Similarity can be used in conjunction to parametric or non-parametric methods

used to:

classify, predict (regression)

t

y

observed features:

(x, y)

supervised ML: classification

Tree Methods

split spaces along each axis separately

A subset of variables has class labels. Guess the label for the other variables

split along x

if t <= a :
	return blue
else:
	return orange

target features:

(color)

t

y

observed features:

(x, y)

Tree Methods

split spaces along each axis separately

A subset of variables has class labels. Guess the label for the other variables

split along x

if t <= a :
	if y <= b:
      		return blue
return orange

then

along y

target features:

(color)

supervised ML: classification

t 

y

observed features:

(x, y)

A subset of variables has class labels. Guess the label for the other variables

split along x

if x <= a :
	if y <= b:
      		return blue
return orange

then

along y

target features:

(color)

supervised ML: classification

this makes it ideal when you have hybrid feature types (e.g. numerical vs categorical)

Tree Methods

split spaces along each axis separately

 

 

 

 

 

 

Tree Methods

example of supervised learning method

partitions feature space along each feature separately

 The good

  • Non-Parametric
  • White-box: can be easily interpreted
  • Works with any feature type and mixed feature types
  • Works with missing data
  • Robust to outliers

 

 

The bad

  • High variability (-> use ensamble methods)
  • Tendency to overfit
  • (not really easily interpretable after all...)

MLTSA:

 

distance

2

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pxi1xj1p + xi2xj2p + ... + xiNxjNpD(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}
D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pxi1xj1p + xi2xj2p + ... + xiNxjNpD(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}
D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

L1 is the Minkowski distance with p=1

 

L2 is the Minkowski distance with p=2

D(i,j) = kxikxjkD(i,j)~=~\sum_k|x_{ik}-x_{jk}|
D(i,j)~=~\sum_k|x_{ik}-x_{jk}|
D(i,j) = 2k(xikxjk)2D(i,j)~=~^{2}\sqrt{\sum_k(x_{ik}-x_{jk})^2}
D(i,j)~=~^{2}\sqrt{\sum_k(x_{ik}-x_{jk})^2}

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

D(i,j)>0D(i,i)=0D(i,j) = D(j,i)D(i,j) <= D(i,k) + D(k,j)D(i,j) > 0\\ D(i,i) = 0\\ D(i,j)~=~D(j,i)\\ D(i,j)~<=~D(i,k)~+~D(k,j)
D(i,j) > 0\\ D(i,i) = 0\\ D(i,j)~=~D(j,i)\\ D(i,j)~<=~D(i,k)~+~D(k,j)

properties:

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

DMan(i,j) = k=1NxikxjkD_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|
D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

DMan(i,j) = k=1NxikxjkD_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|
D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Euclidean: p=2

DEuc(i,j) = k=1Nxikxjk2D_{Euc}(i,j)~=~\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^2}
D_{Euc}(i,j)~=~\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^2}

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

L1 is the Minkowski distance with p=1

 

L2 is the Minkowski distance with p=2

Residuals

3

3

2

2

2

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

L1 is the Minkowski distance with p=1

 

L2 is the Minkowski distance with p=2

Residuals

2

2

3

2

2

3

L1 = 7

distance metrics

continuous variables

Minkowski family of distances

D(i,j) = 1/pk=1NxikxjkpD(i,j)~=~^{1/p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{1/p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

L1 is the Minkowski distance with p=1

 

L2 is the Minkowski distance with p=2

Residuals

2

2

3

L1 = 7

L2 = 17

distance metrics

Minkowski family of distances

D(i,j) = pk=1NxikxjkpD(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}
D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Great Circle distance

D(i,j) = R arccos(sinϕisinϕj + cosϕicosϕjcosΔλ)D(i,j)~=~R~\arccos{\left( \sin{\phi}_i \cdot \sin{\phi_j} ~+~ \cos{\phi_i} \cdot \cos{\phi_j} \cdot \cos{\Delta\lambda} \right)}
D(i,j)~=~R~\arccos{\left( \sin{\phi}_i \cdot \sin{\phi_j} ~+~ \cos{\phi_i} \cdot \cos{\phi_j} \cdot \cos{\Delta\lambda} \right)}
ϕi,λi,ϕj,λj\phi_i,\lambda_i,\phi_j,\lambda_j
\phi_i,\lambda_i,\phi_j,\lambda_j

features

latitude and longitude

continuous variables

distance metrics

Uses presence/absence of features in data

: number of features in neither

Mi=0,j=0M_{i=0,j=0}
M_{i=0,j=0}

: number of features in both

Mi=1,j=1M_{i=1,j=1}
M_{i=1,j=1}

: number of features in i but not j

Mi=1,j=0M_{i=1,j=0}
M_{i=1,j=0}

: number of features in j but not i

Mi=0,j=1M_{i=0,j=1}
M_{i=0,j=1}

categorical variables:

binary

1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

distance metrics

Simple Matching Distance 

SMD(i,j) = 1 SMC(i,j)SMD(i,j)~=~1~- SMC(i,j)
SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

Mi=0,j=0M_{i=0,j=0}
M_{i=0,j=0}

: number of features in both

Mi=1,j=1M_{i=1,j=1}
M_{i=1,j=1}

: number of features in i but not j

Mi=1,j=0M_{i=1,j=0}
M_{i=1,j=0}

: number of features in j but not i

Mi=0,j=1M_{i=0,j=1}
M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j) = Mi=0,j=0+Mi=1,j=1Mi=0,j=0+Mi=1,j=0+Mi=0,j=1+Mi=1,j=1SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Simple Matching Distance 

SMD(i,j) = 1 SMC(i,j)SMD(i,j)~=~1~- SMC(i,j)
SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

Mi=0,j=0M_{i=0,j=0}
M_{i=0,j=0}

: number of features in both

Mi=1,j=1M_{i=1,j=1}
M_{i=1,j=1}

: number of features in i but not j

Mi=1,j=0M_{i=1,j=0}
M_{i=1,j=0}

: number of features in j but not i

Mi=0,j=1M_{i=0,j=1}
M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j) = Mi=0,j=0+Mi=1,j=1Mi=0,j=0+Mi=1,j=0+Mi=0,j=1+Mi=1,j=1SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Simple Matching Distance 

SMD(i,j) = 1 SMC(i,j)SMD(i,j)~=~1~- SMC(i,j)
SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

Mi=0,j=0M_{i=0,j=0}
M_{i=0,j=0}

: number of features in both

Mi=1,j=1M_{i=1,j=1}
M_{i=1,j=1}

: number of features in i but not j

Mi=1,j=0M_{i=1,j=0}
M_{i=1,j=0}

: number of features in j but not i

Mi=0,j=1M_{i=0,j=1}
M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j) = Mi=0,j=0+Mi=1,j=1Mi=0,j=0+Mi=1,j=0+Mi=0,j=1+Mi=1,j=1SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Simple Matching Distance 

SMD(i,j) = 1 SMC(i,j)SMD(i,j)~=~1~- SMC(i,j)
SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

Mi=0,j=0M_{i=0,j=0}
M_{i=0,j=0}

: number of features in both

Mi=1,j=1M_{i=1,j=1}
M_{i=1,j=1}

: number of features in i but not j

Mi=1,j=0M_{i=1,j=0}
M_{i=1,j=0}

: number of features in j but not i

Mi=0,j=1M_{i=0,j=1}
M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j) = Mi=0,j=0+Mi=1,j=1Mi=0,j=0+Mi=1,j=0+Mi=0,j=1+Mi=1,j=1SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Jaccard similarity

J(i,j) = Mi=1,j=1Mi=0,j=1+Mi=1,j=0+Mi=0,j=0J(i,j)~=~\frac{M_{i=1,j=1}}{M_{i=0,j=1}+M_{i=1,j=0}+M_{i=0,j=0}}
J(i,j)~=~\frac{M_{i=1,j=1}}{M_{i=0,j=1}+M_{i=1,j=0}+M_{i=0,j=0}}

Jaccard distance

D(i,j) = 1J(i,j)D(i,j)~=~1 - J(i,j)
D(i,j)~=~1 - J(i,j)
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Jaccard similarity

J(i,j) = J(i,j)~=~
J(i,j)~=~

Jaccard distance

D(i,j) = 1J(i,j)D(i,j)~=~1 - J(i,j)
D(i,j)~=~1 - J(i,j)
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

BB
B
AB{A\cap B}
{A\cap B}
AA
A
AB{A\cup B}
{A\cup B}
ABAB\frac{A\cap B}{A\cup B}
\frac{A\cap B}{A\cup B}

categorical variables:

binary

distance metrics

MLTSA:

 

distance in time series

2.1

distance in time series

simple time series distance: 

Euclidean point by point

 

D=k(y1ky2k)2D = \sqrt{\sum_k (y1_k - y2_k)^2}
D = \sqrt{\sum_k (y1_k - y2_k)^2}

distance in time series

simple time series distance: 

Euclidean point by point

 

D=k(y1ky2k)2D = \sqrt{\sum_k (y1_k - y2_k)^2}
D = \sqrt{\sum_k (y1_k - y2_k)^2}

distance in time series

https://data.worldbank.org/indicator/SP.POP.TOTL

distance in time series

time series are vectors

example of distance metric that works on vectors: 

correlation coefficient r

 

 

distance in time series

how similar is similar? these time series are of the same phenomenon: eclipsing binaries. But they look different enough that it would be hard to write a classifier that recognizes this

distance in time series

what if the time series is shifted? : these are identical time series shifted along the x axis. The correlation coefficient would be low tho!

distance in time series

what if the time series is shifted? : these are identical time series shifted along the x axis. The correlation coefficient would be low tho!

what if the time series is stretched? : these are identical time series but the top one is stretched. Similarly the 

distance in time series

One possible approach: DTW algorithm

Dynamic Time Warping  (future class maybe?)

distance in time series

Another approach: learn features from data and measure distance between those

st deviation, min-max distance,

peaks in frequency space....

MLTSA:

 

CART

classification and regression trees

3

4.1

single tree

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

 features:

  • gender
  • ticket class
  • age

 

 

target variable:

    ->  survival (y/n)  ​

gender (binary)

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

 features:

  • gender
  • ticket class
  • age

 

 

target variable:

    ->  survival (y/n)  ​

gender (binary)

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

optimize over purity:

 

p = Nlargest classNtotalsetp~ = ~\frac{N_{largest~class}}{N_{total set}}
p~ = ~\frac{N_{largest~class}}{N_{total set}}
p=360360+93p=\frac{360}{360+93}
p=\frac{360}{360+93}
p=197197+64p=\frac{197}{197+64}
p=\frac{197}{197+64}

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

 features:

  • gender
  • ticket class
  • age

 

 

target variable:

    ->  survival (y/n)  ​

gender (binary)

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

optimize over purity:

 

p = Nlargest classNtotalsetp~ = ~\frac{N_{largest~class}}{N_{total set}}
p~ = ~\frac{N_{largest~class}}{N_{total set}}
p=79%p= 79\%
p= 79\%
p=75%p=75\%
p=75\%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

 features:

  • gender 79%|75%
  • ticket class 66 | 54%
  • age

 

 

target variable:

    ->  survival (y/n)  ​

1st

Ns=120 Nd=80

2nd +3rd

Ns=234 Nd=298

p=66%p= 66\%
p= 66\%
p=54%p=54\%
p=54\%

class (ordinal)

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

 features:

  • gender 79%|75%
  • ticket class 66% | 54%
  • age 66% | 61%

 

 

target variable:

    ->  survival (y/n)  ​

age (continuous)

>6.5

Ns=250 Nd=107

<=6.5

Ns=139 Nd=217

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

target variable:

    ->  survival (y/n)  ​

gender (binary)

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

 features:

  • gender 79|75%
  • ticket class M 60|85% F 96|65%
  • age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

target variable:

    ->  survival (y/n)  ​

gender (binary)

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

 features:

  • gender 79|75%
  • ticket class M 60|85% F 96|65%
  • age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

target variable:

    ->  survival (y/n)  ​

gender

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

age

 

>6.5

Ns=250 Nd=107

<=6.5

Ns=139 Nd=217

class

1st + 2nd

Ns=120 Nd=80

3rd

Ns=234 Nd=298

 features:

  • gender 79|75%
  • ticket class M 60|85% F 96|65%
  • age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

target variable:

    ->  survival (y/n)  ​

gender

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

age

>6.5

Ns=250 Nd=107

p=82%

<=6.5

Ns=139 Nd=217

p=67%

class

age

>2.5

Ns=1 Nd=1

p=50%

<=2,5

Ns=8 Nd=139

p=95%

age

>38.5

Ns=44 Nd=46

<=38.5

Ns=11 Nd=1

1st + 2nd

Ns=120 Nd=80

3rd

Ns=234 Nd=298

 features:

  • gender 79|75%
  • ticket class M 60|85% F 96|65%
  • age M 74|67% F 66|60%

(Kaggle)

Application:

a robot to predict surviving the Titanic

714 passengers Ns=424 Nd=290

 features:

  • gender (binary already used)
  • ticket class (ordinal)
  • age (continuous) 

 

 
 

 

target variable:

    ->  survival (y/n)  ​

gender

M

 Ns=93 Nd=360

F

Ns=197 Nd=64

>6.5

Ns=250 Nd=107

p=82%

<=6.5

Ns=139 Nd=217

p=67%

1st + 2nd

Ns=120 Nd=80

3rd

Ns=234 Nd=298

1st

Ns=100 Nd=20

p=80%

2nd

Ns=40 Nd=40

p=50%

age

>38.5

Ns=44 Nd=46

<=38.5

Ns=11 Nd=1

class

age

class

A single tree

nodes

(make a decision)

root node

branches

(split off of a node)

leaves (last groups)

tree hyperparameters

gini impurity

IG(p) = 1i=1Jpi2{\displaystyle \operatorname {I} _{G}(p)~=~1-\sum _{i=1}^{J}{p_{i}}^{2}}
{\displaystyle \operatorname {I} _{G}(p)~=~1-\sum _{i=1}^{J}{p_{i}}^{2}}

information gain (entropy)

H(T) =i=1Jpilog2pi{\displaystyle \mathrm {H} (T)~=-\sum _{i=1}^{J}{p_{i}\log _{2}p_{i}}}
{\displaystyle \mathrm {H} (T)~=-\sum _{i=1}^{J}{p_{i}\log _{2}p_{i}}}

A single tree: hyperparameters

depth

A single tree: hyperparameters

max depth = 2

A single tree: hyperparameters

max depth = 2

PREVENTS OVERGFITTING

A single tree: hyperparameters

alternative: tree pruning

A single tree: hyperparameters

4.2

regression with trees

CART: Classification and Regression Trees

Trees can be used for regression 

(think about it as very many small classes)

4.3

tree ensambles

issues with trees

 

variance:

 different trees lead to different results

issues with trees

 

variance:

 different trees lead to different results

why?

because calculating the criterion for every split and every mote is an untractable problem!

e.g. 2 coutinuous variables would be a problem of order

2\infty^2
\infty^2

issues with trees

 

variance:

 different trees lead to different results

solution

run many trees and take an "ensamble" decision!

 

Random Forests

Gradient Boosted Trees

a bunch of parallel trees

a series of trees

tree ensemble methods

Gradient boosted trees:

trees run in series (one after the other)


each tree uses different weights for the features learning the weighs from the previous tree


the last tree has the prediction

 

Random forest:

trees run in parallel (independently of each other)


each tree uses a random subset of observations/features (boostrap - bagging)


class predicted by majority vote:

what class do most trees think a point belong to

feature importance

In principle CART methods are interpretable

you can measure the influence that each feature has on the decision : feature importance

4.4

feature importance

In principle CART methods are interpretable

you can measure the influence that each feature has on the decision : feature importance

In practice the interpretation is complicated by covariance of features

MLTSA:

 

Extraction of features

3

Consider a classification task:

if I want to use machine learning methods I need to choose:

use raw representation:

e.g. clustering:

1) take each time series and standardize it

(mean 0 standard 1). 

2) for each time stamps compare them to the expected value (mean and stdev)

essentially each datapoint is treated as a feature

Consider a classification task:

if I want to use machine learning methods (e.g. clustering) I need to choose:

use raw representation

1) take each time series and standardize it (μ=0 ; σ=1). 

2) for each time stamps compare them to the expected value (μ & σ)

 

problems:

  1. scalability: for N time series of lenght d the dataset has dimension Nd
  2. time series may be asynchronous
  3. time series may be warped

(in small dataset you can optimize over warping and shifting but in large dataset this solution is computationally limited)

essentially each datapoint is treated as a feature

Consider a classification task:

if I want to use machine learning methods (e.g. clustering) I need to choose:

choose a low dimensional representation

essentially each datapoint is treated as a feature

Extract features that describe the time series:

simple descriptive statistics (look at the distribution of points, regardless of the time evolution: 

  • mean
  • standard deviation
  • other moments (skewness, kurtosis)

parametric features (based on fitting model to data):

  • slope of a line fit
  • intercept of a line fit
  • best ARMA parameters

Consider a classification task:

the learned representations should:

  • preserve the pairwise similarities and serve as feature vectors for machine learning methods;
  • lower bound the comparison function to accelerate similarity search;
  • allow using prefixes of the representations (by ranking their coordinates in descending order of importance) for scaling methods under limited resources;
  • support efficient and memory-tractable computation for new data to enable operations in online settings; and
  • support efficient and memory-tractable eigendecomposition of the datato-data similarity matrix to exploit highly effective methods that rely on such cornerstone operation.

resources

 

 

a comprehensive review of clustering methods

Data Clustering: A Review, Jain, Mutry, Flynn 1999

https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

 

a blog post on how to generate and interpret a scipy dendrogram by Jörn Hees
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

clustering

a model that uses clustering to generate features http://www.vldb.org/pvldb/vol12/p1762-paparrizos.pdf

a blog post on how to generate and interpret a scipy dendrogram by Jörn Hees
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

key concepts

 

Clustering :  unsupervised learning where all features are observed for all datapoints. The goal is to partition the space into maximally homogeneous maximally distinguished groups

clustering is easy, but interpreting results is tricky

Distance : A definition of distance is required to group observations/ partition the space.

Common distances over continuous variables

  • Minkowski (inlcudes Euclidian = Minkowski(2)
  • Great Circle (for coordinated on a sphere, e.g. earth or sky)

Common distances over categorical variables:

  • Simple Distance Matrix                                                                 
  • Jaccard Distance  

Clustering : k-means - pros: its efficient and intuitive, cons: only works on euclidean distance, need to know the number of clusters

 

hierarchical: cons: less efficient, pros: provides the full length clustering tree

resources

 

 

http://what-when-how.com/artificial-intelligence/decision-tree- applications-for-data-modelling-artificial-intelligence/

 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4466856/

 

CART

key concepts

 

Machine Learning includes models that learn parameters from data

ML models have parameters learned from the data and hyperparameters assigned by the user.

Unsupervised learning:

  • all variables observed for all data points
  • learns the structure of the features space from the data
  • predicts a label (group of belonging) based on similarity of all features

Supervised learning:

  • a target feature is observed only for a subset of the data
  • learns target feature for data where it is not observed based on similarity of the other features
  • predicts a class/value for each datum without observed label 

Tree methods:

  • partition the space one feature at a time with binary choices
  • prone to overfitting
  • can be used for regression 

Distributed and parallel time series feature extraction for industrial big data applications

Maximilian Christ a , Andreas W. Kempa-Liehrb,c, Michael Fein

https://arxiv.org/pdf/1610.07717.pdf

 

TL;DR:

https://towardsdatascience.com/time-series-feature-extraction-for-industrial-big-data-iiot-applications-5243c84aaf0e

 

resources

 

Feature extractions from time series

resources

 

Feature extractions from time series

Reading

 

Sections 1-2

Homework

Reproduce the class lab

Visualization of the week

U.S. Population Pyramids

MLTSA_07 2025

By federica bianco

MLTSA_07 2025

feature bases methods

  • 36