Principles of Urban Science

 

NLP and LLLms:

Transformers

 

dr. federica bianco 

 

@fedhere

ML model performance

0

ML model performance

LR = _____________________________

 

True Negative

False Negative

H0 is True H0 is False
H0 is falsified Type I Error
False Positive
True Positive
H0 is not falsified
​True Negative Type II Error
False Negative

Accuracy, Recall, Precision

Attention is all you need

Encoder + Decoder architecture

Encodes the past

Encodes the past

Decodes the past predicts the future

Encoder + Decoder architecture

v1 v2 v3 v4 v5 v6
k1 1 0.1 0.1 0.1 0.1 0.1
k2 0.2 1 0.1 0.6 0.8 0.2
k3 0.1 0.1 1 0.2 0.1 0.2
k4 0.6 0.7 0.1 1 0.5 0.9
k5 0.1 0.9 0.1 0.3 1 0.1
k6 0.1 0.5 0.3 0.7 0.3 1

The cat that ate

was

full

 

The cat that ate was full

encodes the past

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism,

(2) a positionwise fully connected feed-forward NN

ML model performance

LR = _____________________________

 

True Negative

False Negative

H0 is True H0 is False
H0 is falsified Type I Error
False Positive
True Positive
H0 is not falsified
​True Negative Type II Error
False Negative

important message spammed

spam in

your inbox

Accuracy, Recall, Precision

ML model performance

Accuracy, Recall, Precision

Precision

Recall

Accuracy

= \frac{TP}{TP~+~FP}
= \frac{TP}{TP~+~FN}
= \frac{TP~+~TN}{TP~+~TN~+~FP~+~FN}

TP=True Positive

FP=False Positive

TN=True Negative

FN=False Positive

ML model performance

Sensitivity Specificity

 

True Positive Rate = TP / All Positive Labels

 

Sensitivity = TPR = Recall

 

False Positive Rate = FP / All Negative Labels

 

Specificity = TP / All Negative Labels = 1 - FPR

 

 

ML model performance

F score

A factor indicating how much more important recall is than precision. For example, if we consider recall to be twice as important as precision, we can set β to 2. The standard F-score is equivalent to setting β to one.

ML model performance

F score

Class Imbalance

Current classifier accuracy: 50%

 

Precision?

 

Recall?

 

Specificity?

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall?

 

Specificity?

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall: 0.5

 

Specificity?

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall: 0.5

 

Specificity: 0.5

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall: 0.5

 

Specificity: 0.5

 

Sensitivity: 0.5

Class Imbalance

Current classifier accuracy: 

 

Precision: 

 

Recall: 

 

Specificity: 

 

Sensitivity: 

Class Imbalance

Current classifier accuracy: 80%

 

Precision: 0.8

 

Recall: 1

 

Specificity: 0 

 

Sensitivity: 1

Receiver operating characteristic

 

along the curve, the classifier probability threshold t is what changes

 

 

 

{\rm class} = i~{\rm if} ~p_i > t

Receiver operating characteristic

 

GOOD

BAD

Receiver operating characteristic

 

GOOD

BAD

tuning by changing hyperparameters

1

what is

Natural Language

1. Computers only know numbers, not words

2. Language's constituent elements are words

3. Meaning depends on words, how they are combined, and on the context

That is great!

1. Computers only know numbers, not words

2. Language's constituent elements are words

3. Meaning depends on words, how they are combined, and on the context

That is great!

That is not great...

1. Computers only know numbers, not words

2. Language's constituent elements are words

3. Meaning depends on words, how they are combined, and on the context

That is great!

That is not great...

2

NLP preprocessing

tokenization and parsing : splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

** we will see how its done

Word tockenization and embedding

Word tockenization 

Word tockenization and embedding and contextualizing

vector embedding (768)

<by> o <river> -> small  

near orthogonal embedding, low similarity vectors,

no strong relation between tokens

by

river

Word tockenization and embedding and contextualizing

vector embedding (768)

<by> o <river> -> small  

<river> o <bank> -> large

near orthogonal embedding, low similarity vectors,

no strong relation between tokens

by

river

near parallel embedding, high similarity vectors,

strong relation between tokens

bank

river

lemmatization/stemming :  reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 

 

am, are, is --> be

dog, dogs, dog's --> dog

part-of-speech tagging:  marking up a word in a text (corpus) as corresponding to a particular part of speech

 

language detection:  automatically detecting which language is used

3

NLP

descriptive tasks

Statistical properties of the "corpus"

how many characters

how many words

how many sentences

how many paragraphs

how many proper names

how often each proper name appears

 

Statistical properties of the "corpus"

Content categorization

search and indexing

content alerts and duplication detection

plagiarism detection

4

NLP

predictive tasks

  • Topic discovery and modeling. capture the meaning and themes in text collections (associated tasks: optimization and forecasting)
  • Contextual extraction. Automatically pull structured information from text-based sources.
  • Sentiment analysis. 

AI and ML supported NLP tasks

Identifying the mood or subjective opinions within large amounts of text, including average sentiment and opinion mining.

  • Speech-to-text and text-to-speech conversion. Transforming voice commands into written text, and vice versa. 
  • Document summarization. Automatically generating synopses of large bodies of text and detect represented languages in multi-lingual corpora (documents).
  • Machine translation. Automatic translation of text or speech from one language to another.
  • Text generation automatic captioning

Sentiment analysis

Is the sentiment positive, negative, neutral

Applications? 

Sentiment analysis

Is the sentiment positive, negative, neutral

Applications? 

Social media monitoring

detection of hate speach, measure the health of a conversation

Customer support ticket analysis 

VoC Voice of Customer - Voice of Employee

Brand monitoring and reputation management

 

** we will see how its done

A simple NLP model:

Bag of words

 

You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...

 

 

 

 

Suppose you have a text and you want to classify if it is "scify" or "romance"

 

A simple NLP model:

Bag of words

 

You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...

 

 

 

 

Suppose you have a text and you want to classify if it is "scify" or "romance"

 

  • first you need to tokenize and lemmatize so that of e.g. "alien" and "aliens" will be counted together

A simple NLP model:

Bag of words

 

You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...

 

 

 

 

Suppose you have a text and you want to classify if it is "scify" or "romance"

 

  • first you need to tokenize and lemmatize so that of e.g. "alien" and "aliens" will be counted together
  • then you have to remove words like "and" because they contaminate the count - those are called "stopwords"

A simple NLP model:

Bag of words

 

You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...

 

 

 

 

Suppose you have a text and you want to classify if it is "scify" or "romance"

 

  • first you need to tokenize and lemmatize so that of e.g. "alien" and "aliens" will be counted together
  • then you have to remove words like "and" because they contaminate the count - those are called "stopwords"
  • then you can count

A simple NLP model:

Bag of words

 

You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...

 

 

 

 

Suppose you have a text and you want to classify if it is "scify" or "romance"

 

  • first you need to tokenize and lemmatize so that of e.g. "alien" and "aliens" will be counted together
  • then you have to remove words like "and" because they contaminate the count - those are called "stopwords"
  • then you can count
  • then you can apply _any_ predictoin model: e.g. Random Forest, Logistic Regression or Naive Bayes (common)

** we will see how its done

vanishing gradient

5

 

state space model

y_t=Hx_t+\epsilon_t;~~\epsilon_t∼N(0,\Sigma^2_\epsilon)
x_{t} =\Phi x_{t-1} + \nu_t;~~\nu_t∼N(0,\Sigma^2_\nu)

A State-space model is a model to derive the value of a time-dependent variable x(t), the state, generated by a noisy Markovian process, from observations of a variable y(t), also subject to noise, linearly related to the target variable

Definition

Training a DNN

1994

An time-domain enabled AI system should:

Training a DNN

you need to pick

1994

Training a DNN

you need to pick

Training a DNN

1994

We show why gradient based learning algorithms face an increasingly dicult problem as the duration of the dependencies to be captured increases

the magnitude of the derivative of the state of a dynamical system at time t with respect to the state at time 0 decreases exponentially as t increases.

We show why gradient based learning algorithms face an increasingly dicult problem as the duration of the dependencies to be captured increases

you need to pick

Training a DNN

you need to pick

Training a DNN

1994

the algorithm: Stochastic Gradient Descent

assume a simpler line model   y = ax 

(b = 0) so we only need to find the "best" parameter a

1. choose initial value for a

2. calculate the SSE

3. take the gradient of the SSE and step in proportion

: the gradient is the slope of a line tangential to a point on a curve

\nabla x
\mathrm{Gradient~descent}\\ l.r. = \eta * \nabla x_{x=a_0}

RNN architecture

 

 

RNN architecture

 

 

input layer

output layer

hidden layers

Feed-forward NN architecture

 

 

RNN architecture

 

 

output layer

hidden layers

Recurrent NN architecture

 

 

input layer

output layer

RNN hidden layers

output layer

hidden layers

input layer

Feed-forward NN architecture

 

 

RNN architecture

 

 

input layer

output layer

RNN hidden layers

current state

previous state

In TSA this is a State Space Probem

we want process a sequence of vectors x applying a recurrence formula at every time step:

h_t = f_q(h_{t-1}, x_t)

current input

RNN architecture

 

 

input layer

output layer

RNN hidden layers

h_t = f_q(h_{t-1}, x_t)

current state

features

(can be time dependent)

function with parameters q

In TSA this is a State Space Probem

we want process a sequence of vectors x applying a recurrence formula at every time step:

previous state

RNN architecture

 

 

input layer

output layer

RNN hidden layers

Simplest possible RNN

h_t = f_q(h_{t-1}, x_t)
y_t = Q_{hy}\cdot h_{t}

Whh

Wxh

Qhy

RNN architecture

 

 

input layer

output layer

RNN hidden layers

Simplest possible RNN

h_t = tanh(W_{hh}\cdot h_{t-1},W_{xh}\cdot x_t)\\

Whh

Wxh

Qhy

y_t = Q_{hy}\cdot h_{t}

RNN architecture

 

 

input layer

Alternative graphical representation of RNN

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Qhy

Whh

Whh

Whh

Whh

Whh

Wxh

the weights are the same! always the same Whh and Qhy

h_t = f_q(h_{t-1}, x_t)
y_t = Q_{hy}\cdot h_{t}

Qhy

Qhy

Qhy

Qhy

RNN architecture

 

 

applications

RNN architecture

 

 

applications

image captioning

 

sentiment analysis

 

language translation

 classificationin real time

RNN architecture

 

 

more complicated  RNNs

Some layers will be recurrent, others will not. Does not need to be fully connected

RNN architecture

 

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)

RNN architecture

 

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)

The      cats      that     ate      were     full

The      cat        that     ate      was       full

RNN architecture

 

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)

LOSS

RNN architecture

 

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)
\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

RNN architecture

 

 

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)
\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}
\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

\frac{\partial h_t}{\partial h_k} = \prod_{i=1}^{k} \frac{\partial h_t}{\partial h_{k-i}}

e(t)

y(t+5)

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

RNN architecture

 

 

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)
\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}
\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

e(t)

y(t+5)

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

\left| \frac{\partial h_t}{\partial h_{t-1}} \right|< 1 \rightarrow 0
\left|\frac{\partial h_t}{\partial h_{t-1}}\right| > 1 \rightarrow \infty

vanishing gradient problem!

 

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

Why

obsesses

over

recent

past

forgets

remote

past

vanishing gradient problem!

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

vanishing gradient problem is exacerbated by having the same set of weights. 

 

The vanishing gradient problem causes early layer to not to learn as effectively

 

The earlier layers learn from the remote past

 

As a result: vanilla RNN would only have short term memory (only learn from recent states)

Whh

Whh

Whh

Whh

Whh

LSTM

4

Ct: output

h: hidden states

X: input

Ct-1 : previous cell state (previous output)

ht-1 : previous hidden state

xt : current state (input)

 

 

 

forget gate:

do i keep memory of this past step

f^{(t)} = \sigma(W^f[h_{t-1},x_t] + b^f)

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

input gate:

do I update the current cell? 

 

i^{(t)} = \sigma(W^i[h_{t-1},x_t] + b^i)
\hat{C}^{(t)} = tanh(W^C[h_{t-1},x_t] + b^C)

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

cell state:

procuces the prediction

C^{(t)} = C^{(t-1)} \times f^{(t)}+ i^{(t)} \times \hat{C}^{(t)}

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

output gate

previous input that goes into the hidden state

o^{(t)} = \sigma(W^o[h_{t-1},x_t] + b^o)

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

hidden state

produces the new hidden states

h^{(t)} = o^{(t)} *\tanh\left( C^{(t)}\right)

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

RNN architecture

 

 

input layer

output layer

hidden layers

Feed-forward NN architecture

 

 

RNN architecture

 

 

output layer

hidden layers

Feed-forward NN architecture

 

 

Recurrent NN architecture

 

 

input layer

output layer

RNN hidden layers

output layer

hidden layers

input layer

RNN architecture

 

 

input layer

output layer

RNN hidden layers

current state

previous state

Remember the state-space problem!

we want process a sequence of vectors x applying a recurrence formula at every time step:

h_t = f_q(h_{t-1}, x_t)

RNN architecture

 

 

input layer

output layer

RNN hidden layers

Remember the state-space problem!

we want process a sequence of vectors x applying a recurrence formula at every time step:

h_t = f_q(h_{t-1}, x_t)

current state

previous state

features

(can be time dependent)

function with parameters q

RNN architecture

 

 

input layer

output layer

RNN hidden layers

Simplest possible RNN

h_t = f_q(h_{t-1}, x_t)
y_t = Q_{hy}\cdot h_{t}

Whh

Wxh

Qhy

RNN architecture

 

 

input layer

output layer

RNN hidden layers

Simplest possible RNN

h_t = tanh(W_{hh}\cdot h_{t-1},W_{xh}\cdot x_t)\\
y_t = Q_{hy}\cdot h_{t}

Whh

Wxh

Qhy

RNN architecture

 

 

input layer

Alternative graphical representation of RNN

h_t = f_q(h_{t-1}, x_t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

the weights are the same! always the same Whh and Why

RNN architecture

 

 

appllications

image captioning:

one image to a

sequence of words

RNN architecture

 

 

appllications

image captioning:

one image to a

sequence of words

sentiment analysis

sequence of words to one sentiment

RNN architecture

 

 

appllications

image captioning:

one image to a

sequence of words

sentiment analysis

sequence of words to one sentiment

language translator

sequence of words to sequence of words 

RNN architecture

 

 

appllications

image captioning:

one image to a

sequence of words

sentiment analysis

sequence of words to one sentiment

language translator

sequence of words to sequence of words 

online: video classification frame by frame

RNN architecture

 

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)

RNN architecture

 

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)

LOSS

RNN architecture

 

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)
\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}

RNN architecture

 

 

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)
\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}
\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

\frac{\partial h_t}{\partial h_k} = \prod_{i=1}^{k} \frac{\partial h_t}{\partial h_{k-i}}

e(t)

y(t+5)

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

RNN architecture

 

 

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

each output has its own loss

Why

h_t = W_h\phi(h_{t-1}) + W_{x}x(t)
y_t = W_y\phi(h_t)
\frac{\partial E}{\partial \theta} = \sum_{t=1}^{N}\frac{\partial e_t}{\partial \theta}
\frac{\partial e_t}{\partial \theta} =\sum_{k=1}^{t} \frac{\partial e_t}{\partial y_t} \frac{\partial y_t}{\partial h_t} \frac{\partial h_k}{\partial W} \frac{\partial h_t}{\partial h_k}

Total loss:

e(t)

y(t+5)

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

\left| \frac{\partial h_t}{\partial h_{t-1}} \right|< 1 \rightarrow 0
\left|\frac{\partial h_t}{\partial h_{t-1}}\right| > 1 \rightarrow \infty

RNN architecture

 

 

vanishing gradient problem!

 

input layer

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

Why

Learns Fast!

Learns slow!

RNN

obsesses

over

recent

past

forgets

remote

past

vanishing gradient problem!

 

input layer

e(t)

h(t-1)

h(t)

h(t+1)

h(t+2)

h(t+3)

h(t+4)

y(t)

y(t+1)

y(t+2)

y(t+4)

y(t+3)

y(t+5)

Why

Why

Why

Why

Why

Whh

Whh

Whh

Whh

Whh

Wxh

Why

e(t+1)

e(t+2)

e(t+3)

e(t+4)

e(t+5)

vanishing gradient problem is exacerbated by having the same set of weights. 

 

The vanishing gradient problem causes early layer to not to learn as effectively

 

The earlier layers learn from the remote past

 

As a result: vanilla RNN would only have short term memory (only learn from recent states)

Whh

Whh

Whh

Whh

Whh

Ct: output

h: hidden states

X: input

Ct-1 : previous cell state (previous output)

ht-1 : previous hidden state

xt : current state (input)

 

 

 

forget gate:

do i keep memory of this past step

f^{(t)} = \sigma(W^f[h_{t-1},x_t] + b^f)

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

input gate:

do I update the current cell? 

 

i^{(t)} = \sigma(W^i[h_{t-1},x_t] = b^i)
\hat{C}^{(t)} = \sigma(W^C[h_{t-1},x_t] = b^C)

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

cell state:

procuces the prediction

C^{(t)} = C^{(t-1)} \times f^{(t)}+ i^{(t)} \times \hat{C}^{(t)}

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

output gate

previous input that goes into the hidden state

o^{(t)} = \sigma(W^o[h_{t-1},x_t] = b^o)

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

LSTM: long short term memory

solution to the vanishing gradient problem

 

 

even if you want to predict a single time series, you need many example

split the time series into chunks

C_t

LSTM: how to actually run it

 

batch size: how many sequencies you pass at once

timeseries: how many time stamps in a sequence

features: how many measurements in the time seris

 

even if you want to predict a single time series, you need many example

split the time series into chunks

C_t

LSTM: how to actually run it

 

batch size: N

timeseries: 1000              

features: 2

 

model = Sequential()
model.add(LSTM(32, input_shape=(50, 2)))
model.add(Dense(2))

even if you want to predict a single time series, you need many example

split the time series into chunks

C_t

LSTM: how to actually run it

 

To be or not to be? this is the question.  Whether 'tis nobler in the mind  

sequencies of 12 letters

batch size: N

timeseries: 12              

features: 1

 

LSTM: how to actually run it

 

There is no homework on this cause I am at the end of the semester, but if you want to learn more I will upload an exercise over the weekend where you will train an RNN to generate physics paper titles!

 

 

attention

4

Encoder + Decoder architecture

 

Attention mechanism

 

Multithreaded attention

Attention is all you need: transformer model

transformer generalized architecture elements

attention mechanism:

v1 v2 v3 v4
k1 0.1 0.2 0.0 0.1
k2 0.6 0.3 0.1 0.3
k3 0.2 0.1 0.2 0.1
k4 0.6 0.9 0.1 0.8

attention mechanism:

a way to relate elements of the time series with each other

v1 v2 v3 v4
k1 0.1 0.2 0.0 0.1
k2 0.6 0.3 0.1 0.3
k3 0.2 0.1 0.2 0.1
k4 0.6 0.9 0.1 0.8
v1 v2 v3 v4
k1 0.1 0.2 0.0 0.1
k2 0.6 0.3 0.1 0.3
k3 0.2 0.1 0.2 0.1
k4 0.6 0.9 0.1 0.8

The cat that ate

 

was full and happy

was full and happy

attention mechanism:

a way to relate elements of the time series with each other

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

attention mechanism:

a way to relate elements of the time series with each other

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

attention mechanism:

a way to relate elements of the time series with each other

Attention is all you need (2017)

Encoder + Decoder architecture

different elements of the sentence relate to input elements in multiple ways

Multi-headed attention:

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

39

5

903

1238  913 12

W1

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2
v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

1238  913 12

W2

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2
v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

1238  913 12

W3

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

39

5

903

39

5

903

Attention is all you need (2017)

Encoder + Decoder architecture

Key - Value - Query

attention: 

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

1238  913 12

W

39

5

903

The key/value/query concept is analogous to retrieval systems.

key: input

query: output

value... input as well

Attention is all you need (2017)

Encoder + Decoder architecture

attention: 

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

1238  913 12

W

39

5

903

The key/value/query concept is analogous to retrieval systems.

project embedding into Key - Value - Query

lower dimensional representations

Attention is all you need (2017)

Encoder + Decoder architecture

attention: 

project embedding into Key - Value - Query

lower dimensional representations

Attention is all you need (2017)

Encoder + Decoder architecture

The key/value/query concept is analogous to retrieval systems.

Multi-headed Self attention:

Attention is all you need (2017)

Encoder + Decoder architecture

The key/value/query concept is analogous to retrieval systems.

Multi-headed Self attention:

Attention is all you need (2017)

Dot Product Attention

Attention is all you need (2017)

Scaled Dot Product Attention

the dot-product can produce very large magnitudes with very large vector dimensions (d) which will result in very small gradients when passed into the softmax function, we can scale the values prior (scale = 1 / √ d).

Attention is all you need (2017)

Encoder + Decoder architecture

attention: 

project embedding into Key - Value - Query

lower dimensional representations

Encoder Attention

  • Q = the current position-word vector in the input sequence
  • K = all the position-word vectors in the input sequence
  • V = all the position-word vectors in the input sequence

Attention is all you need (2017)

Encoder + Decoder architecture

attention: 

project embedding into Key - Value - Query

lower dimensional representations

Decoder Attention

  • Q = the current position-word vector in the output sequence
  • K = all the position-word vectors in the output sequence
  • V = all the position-word vectors in the output sequence

Attention is all you need (2017)

Encoder + Decoder architecture

attention: 

project embedding into Key - Value - Query

lower dimensional representations

Encoder-Decoder Attention

  • Q = the output of the decoder’s masked attention
  • K = all the encoder’s hidden state vectors
  • V = all the encoder’s hidden state vectors

transformer model

5

Encoder + Decoder architecture

Encodes the past

Turns out attention is not really _all_ you need...

so far we are working with a "bag of words": the order of words is not known to the model

Attention is all you need (2017)

Encoder + Decoder architecture

positional encoding

Attention is all you need (2017)

Encoder + Decoder architecture

positional encoding

Attention is all you need (2017)

Encoder + Decoder architecture

positional encoding

Attention is all you need (2017)

Attention is all you need

Encoder + Decoder architecture

Encodes the past

Encoder + Decoder architecture

decodes the past and predicts the future

MHA acting on encoder (1)

Attention is all you need (2017)

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

Encoder + Decoder architecture

decodes the past and predicts the future

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism act on previous decoder output,

(2) a multi-head self-attention mechanism act on encoder output,

(3) a positionwise fully connected feed-forward NN

MHA acting on encoder (1)

Attention is all you need (2017)

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

 masking dependence on the future

Encoder + Decoder architecture

decodes the past and predicts the future

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism act on previous decoder output,

(2) a multi-head self-attention mechanism act on encoder output,

(3) a positionwise fully connected feed-forward NN

MHA acting on decoder (2)

Attention is all you need (2017)

Attention is all you need (2017)

Encoder + Decoder architecture

Input

Embedding

Positional encoding

Encoder attention

Output

Embedding

Positional encoding

Decoder attention

Encoder-Decoder attention

Feed Forward

Linear

Softmax

Attention is all you need (2017)

GPT3 and society

unexpected consequences of NLP models

 

Vinay Prabhu exposes racist bias in GPT-3

unexpected consequences of NLP models

unexpected consequences of NLP models

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell 

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell 

Last week, Gebru said she was fired by Google after objecting to a manager’s request to retract or remove her name from the paper. Google’s head of AI said the work “didn’t meet our bar for publication.” Since then, more than 2,200 Google employees have signed a letter demanding more transparency into the company’s handling of the draft. Saturday, Gebru’s manager, Google AI researcher Samy Bengio, wrote on Facebook that he was “stunned,” declaring “I stand by you, Timnit.” AI researchers outside Google have publicly castigated the company’s treatment of Gebru.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

We have identified a wide variety of costs and risks associated with the rush for ever larger LMs, including:

environmental costs (borne typically by those not benefiting from the resulting technology);

financial costs, which in turn erect barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques;

opportunity cost, as researchers pour effort away from directions requiring less resources; and the

risk of substantial harms, including stereotyping, denigration, increases in extremist ideology, and wrongful arrest, should humans encounter seemingly coherent LM output and take it for the words of some person or organization who has accountability for what is said.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

When we perform risk/benefit analyses of language technology, we must keep in mind how the risks and benefits are distributed, because they do not accrue to the same people. On the one hand, it is well documented in the literature on environmental racism that the negative effects of climate change are reaching and impacting the world’s most marginalized communities first [1, 27].

Is it fair or just to ask, for example, that the residents of the Maldives (likely to be underwater by 2100 [6]) or the 800,000 people in Sudan affected by drastic floods pay the environmental price of training and deploying ever larger English LMs, when similar large-scale models aren’t being produced for Dhivehi or Sudanese Arabic?

While the average human is responsible for an estimated 5t CO2 per year, the authors trained a Transformer (big) model [136] with neural architecture search and estimated that the training procedure emitted 284t of CO2.

[...]

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

4.1 Size Doesn’t Guarantee Diversity The Internet is a large and diverse virtual space, and accordingly, it is easy to imagine that very large datasets, such as Common Crawl (“petabytes of data collected over 8 years of web crawling”, a filtered version of which is included in the GPT-3 training data) must therefore be broadly representative of the ways in which different people view the world. However, on closer examination, we find that there are several factors which narrow Internet participation [...]

Starting with who is contributing to these Internet text collections, we see that Internet access itself is not evenly distributed, resulting in Internet data overrepresenting younger users and those from developed countries [100, 143]. However, it’s not just the Internet as a whole that is in question, but rather specific subsamples of it. For instance, GPT-2’s training data is sourced by scraping outbound links from Reddit, and Pew Internet Research’s 2016 survey reveals 67% of Reddit users in the United States are men, and 64% between ages 18 and 29. Similarly, recent surveys of Wikipedians find that only 8.8–15% are women or girls [9].

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

4.3 Encoding Bias It is well established by now that large LMs exhibit various kinds of bias, including stereotypical associations [11, 12, 69, 119, 156, 157], or negative sentiment towards specific groups [61]. Furthermore, we see the effects of intersectionality [34], where BERT, ELMo, GPT and GPT-2 encode more bias against identities marginalized along more than one dimension than would be expected based on just the combination of the bias along each of the axes [54, 132].

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said.

LAB: 

 

show that the keras example of time series analysis with tensorflow..... is wrong!! 

 

- visualize and familiarize with the data (which the authors of the notebook had not done)

- create a model like the one that they created (takes a while to train, I saved a pretrained version for you)

 

- look a the loss, which they did not do

 

- remove the attention block and compare the models

https://github.com/fedhere/MLPNS_FBianco/blob/main/transformers/assess_TS_classification_w_tensorflow.ipynb

Text

The FordA dataset 

 

This data was originally used in a competition in the IEEE World Congress on Computational Intelligence, 2008. The classification problem is to diagnose whether a certain symptom exists or does not exist in an automotive subsystem. Each case consists of 500 measurements of engine noise and a classification. There are two separate problems: For FordA the Train and test data set were collected in typical operating conditions, with minimal noise contamination.

reading 

 

resources

 

A video on transformer which I think is really good!

https://www.youtube.com/watch?v=4Bdc55j80l8

 

A video on attention (with a different accent than the one I subjected you all this time!) 

https://www.youtube.com/watch?v=-9vVhYEXeyQ

 

resources

 

Tutorial