federica bianco PRO
astro | data science | data for good
dr. federica bianco
@fedhere
this slide deck:
0
LR = _____________________________
True Negative
False Negative
H0 is True | H0 is False | |
---|---|---|
H0 is falsified | Type I Error False Positive |
True Positive |
H0 is not falsified |
True Negative | Type II Error False Negative |
Attention is all you need
Encoder + Decoder architecture
Encodes the past
Encodes the past
Decodes the past predicts the future
Encoder + Decoder architecture
v1 | v2 | v3 | v4 | v5 | v6 | |
---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 |
The cat that ate
was
full
The cat that ate was full
encodes the past
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism,
(2) a positionwise fully connected feed-forward NN
LR = _____________________________
True Negative
False Negative
H0 is True | H0 is False | |
---|---|---|
H0 is falsified | Type I Error False Positive |
True Positive |
H0 is not falsified |
True Negative | Type II Error False Negative |
important message spammed
spam in
your inbox
Precision
Recall
Accuracy
TP=True Positive
FP=False Positive
TN=True Negative
FN=False Positive
True Positive Rate = TP / All Positive Labels
Sensitivity = TPR = Recall
False Positive Rate = FP / All Negative Labels
Specificity = TP / All Negative Labels = 1 - FPR
|
A factor indicating how much more important recall is than precision. For example, if we consider recall to be twice as important as precision, we can set β to 2. The standard F-score is equivalent to setting β to one.
Current classifier accuracy: 50%
Precision?
Recall?
Specificity?
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall?
Specificity?
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall: 0.5
Specificity?
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall: 0.5
Specificity: 0.5
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall: 0.5
Specificity: 0.5
Sensitivity: 0.5
Current classifier accuracy:
Precision:
Recall:
Specificity:
Sensitivity:
Current classifier accuracy: 80%
Precision: 0.8
Recall: 1
Specificity: 0
Sensitivity: 1
along the curve, the classifier probability threshold t is what changes
GOOD
BAD
GOOD
BAD
tuning by changing hyperparameters
1. Computers only know numbers, not words
2. Language's constituent elements are words
3. Meaning depends on words, how they are combined, and on the context
That is great!
1. Computers only know numbers, not words
2. Language's constituent elements are words
3. Meaning depends on words, how they are combined, and on the context
That is great!
That is not great...
1. Computers only know numbers, not words
2. Language's constituent elements are words
3. Meaning depends on words, how they are combined, and on the context
That is great!
That is not great...
tokenization and parsing : splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.
** we will see how its done
Word tockenization and embedding
Word tockenization
Word tockenization and embedding and contextualizing
vector embedding (768)
<by> o <river> -> small
near orthogonal embedding, low similarity vectors,
no strong relation between tokens
by
river
Word tockenization and embedding and contextualizing
vector embedding (768)
<by> o <river> -> small
<river> o <bank> -> large
near orthogonal embedding, low similarity vectors,
no strong relation between tokens
by
river
near parallel embedding, high similarity vectors,
strong relation between tokens
bank
river
lemmatization/stemming : reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
am, are, is --> be
dog, dogs, dog's --> dog
part-of-speech tagging: marking up a word in a text (corpus) as corresponding to a particular part of speech
language detection: automatically detecting which language is used
how many characters
how many words
how many sentences
how many paragraphs
how many proper names
how often each proper name appears
Content categorization.
search and indexing
content alerts and duplication detection
plagiarism detection
Identifying the mood or subjective opinions within large amounts of text, including average sentiment and opinion mining.
Is the sentiment positive, negative, neutral
Applications?
Is the sentiment positive, negative, neutral
Applications?
detection of hate speach, measure the health of a conversation
Customer support ticket analysis
VoC Voice of Customer - Voice of Employee
Brand monitoring and reputation management
** we will see how its done
You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...
Suppose you have a text and you want to classify if it is "scify" or "romance"
You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...
Suppose you have a text and you want to classify if it is "scify" or "romance"
You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...
Suppose you have a text and you want to classify if it is "scify" or "romance"
You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...
Suppose you have a text and you want to classify if it is "scify" or "romance"
You could count words and calculate the probability of "scify" and "romance" as a function of the frequency of words like "alien", "space ship" and "love", "beautiful" etc...
Suppose you have a text and you want to classify if it is "scify" or "romance"
** we will see how its done
vanishing gradient
5
A State-space model is a model to derive the value of a time-dependent variable x(t), the state, generated by a noisy Markovian process, from observations of a variable y(t), also subject to noise, linearly related to the target variable
Definition
Training a DNN
1994
An time-domain enabled AI system should:
Training a DNN
you need to pick
1994
Training a DNN
you need to pick
Training a DNN
1994
We show why gradient based learning algorithms face an increasingly dicult problem as the duration of the dependencies to be captured increases
the magnitude of the derivative of the state of a dynamical system at time t with respect to the state at time 0 decreases exponentially as t increases.
We show why gradient based learning algorithms face an increasingly dicult problem as the duration of the dependencies to be captured increases
you need to pick
Training a DNN
you need to pick
Training a DNN
1994
the algorithm: Stochastic Gradient Descent
assume a simpler line model y = ax
(b = 0) so we only need to find the "best" parameter a
1. choose initial value for a
2. calculate the SSE
3. take the gradient of the SSE and step in proportion
: the gradient is the slope of a line tangential to a point on a curve
RNN architecture
RNN architecture
input layer
output layer
hidden layers
Feed-forward NN architecture
RNN architecture
output layer
hidden layers
Recurrent NN architecture
input layer
output layer
RNN hidden layers
output layer
hidden layers
input layer
Feed-forward NN architecture
RNN architecture
input layer
output layer
RNN hidden layers
current state
previous state
In TSA this is a State Space Probem
we want process a sequence of vectors x applying a recurrence formula at every time step:
current input
RNN architecture
input layer
output layer
RNN hidden layers
current state
features
(can be time dependent)
function with parameters q
In TSA this is a State Space Probem
we want process a sequence of vectors x applying a recurrence formula at every time step:
previous state
RNN architecture
input layer
output layer
RNN hidden layers
Simplest possible RNN
Whh
Wxh
Qhy
RNN architecture
input layer
output layer
RNN hidden layers
Simplest possible RNN
Whh
Wxh
Qhy
RNN architecture
input layer
Alternative graphical representation of RNN
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Qhy
Whh
Whh
Whh
Whh
Whh
Wxh
the weights are the same! always the same Whh and Qhy
Qhy
Qhy
Qhy
Qhy
RNN architecture
applications
RNN architecture
applications
image captioning
sentiment analysis
language translation
classificationin real time
RNN architecture
more complicated RNNs
Some layers will be recurrent, others will not. Does not need to be fully connected
RNN architecture
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
RNN architecture
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
The cats that ate were full
The cat that ate was full
RNN architecture
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
LOSS
RNN architecture
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
Total loss:
RNN architecture
input layer
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
Total loss:
e(t)
y(t+5)
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
RNN architecture
input layer
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
Total loss:
e(t)
y(t+5)
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
vanishing gradient problem!
input layer
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
Why
obsesses
over
recent
past
forgets
remote
past
vanishing gradient problem!
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
vanishing gradient problem is exacerbated by having the same set of weights.
The vanishing gradient problem causes early layer to not to learn as effectively
The earlier layers learn from the remote past
As a result: vanilla RNN would only have short term memory (only learn from recent states)
Whh
Whh
Whh
Whh
Whh
LSTM
4
Ct: output
h: hidden states
X: input
Ct-1 : previous cell state (previous output)
ht-1 : previous hidden state
xt : current state (input)
forget gate:
do i keep memory of this past step
LSTM: long short term memory
solution to the vanishing gradient problem
input gate:
do I update the current cell?
LSTM: long short term memory
solution to the vanishing gradient problem
cell state:
procuces the prediction
LSTM: long short term memory
solution to the vanishing gradient problem
output gate
previous input that goes into the hidden state
LSTM: long short term memory
solution to the vanishing gradient problem
hidden state
produces the new hidden states
LSTM: long short term memory
solution to the vanishing gradient problem
LSTM: long short term memory
solution to the vanishing gradient problem
RNN architecture
input layer
output layer
hidden layers
Feed-forward NN architecture
RNN architecture
output layer
hidden layers
Feed-forward NN architecture
Recurrent NN architecture
input layer
output layer
RNN hidden layers
output layer
hidden layers
input layer
RNN architecture
input layer
output layer
RNN hidden layers
current state
previous state
Remember the state-space problem!
we want process a sequence of vectors x applying a recurrence formula at every time step:
RNN architecture
input layer
output layer
RNN hidden layers
Remember the state-space problem!
we want process a sequence of vectors x applying a recurrence formula at every time step:
current state
previous state
features
(can be time dependent)
function with parameters q
RNN architecture
input layer
output layer
RNN hidden layers
Simplest possible RNN
Whh
Wxh
Qhy
RNN architecture
input layer
output layer
RNN hidden layers
Simplest possible RNN
Whh
Wxh
Qhy
RNN architecture
input layer
Alternative graphical representation of RNN
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
the weights are the same! always the same Whh and Why
RNN architecture
appllications
image captioning:
one image to a
sequence of words
RNN architecture
appllications
image captioning:
one image to a
sequence of words
sentiment analysis
sequence of words to one sentiment
RNN architecture
appllications
image captioning:
one image to a
sequence of words
sentiment analysis
sequence of words to one sentiment
language translator
sequence of words to sequence of words
RNN architecture
appllications
image captioning:
one image to a
sequence of words
sentiment analysis
sequence of words to one sentiment
language translator
sequence of words to sequence of words
online: video classification frame by frame
RNN architecture
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
RNN architecture
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
LOSS
RNN architecture
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
Total loss:
RNN architecture
input layer
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
Total loss:
e(t)
y(t+5)
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
RNN architecture
input layer
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
each output has its own loss
Why
Total loss:
e(t)
y(t+5)
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
RNN architecture
vanishing gradient problem!
input layer
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
Why
Learns Fast!
Learns slow!
RNN
obsesses
over
recent
past
forgets
remote
past
vanishing gradient problem!
input layer
e(t)
h(t-1)
h(t)
h(t+1)
h(t+2)
h(t+3)
h(t+4)
y(t)
y(t+1)
y(t+2)
y(t+4)
y(t+3)
y(t+5)
Why
Why
Why
Why
Why
Whh
Whh
Whh
Whh
Whh
Wxh
Why
e(t+1)
e(t+2)
e(t+3)
e(t+4)
e(t+5)
vanishing gradient problem is exacerbated by having the same set of weights.
The vanishing gradient problem causes early layer to not to learn as effectively
The earlier layers learn from the remote past
As a result: vanilla RNN would only have short term memory (only learn from recent states)
Whh
Whh
Whh
Whh
Whh
Ct: output
h: hidden states
X: input
Ct-1 : previous cell state (previous output)
ht-1 : previous hidden state
xt : current state (input)
forget gate:
do i keep memory of this past step
LSTM: long short term memory
solution to the vanishing gradient problem
input gate:
do I update the current cell?
LSTM: long short term memory
solution to the vanishing gradient problem
cell state:
procuces the prediction
LSTM: long short term memory
solution to the vanishing gradient problem
output gate
previous input that goes into the hidden state
LSTM: long short term memory
solution to the vanishing gradient problem
LSTM: long short term memory
solution to the vanishing gradient problem
even if you want to predict a single time series, you need many example
split the time series into chunks
LSTM: how to actually run it
batch size: how many sequencies you pass at once
timeseries: how many time stamps in a sequence
features: how many measurements in the time seris
even if you want to predict a single time series, you need many example
split the time series into chunks
LSTM: how to actually run it
batch size: N
timeseries: 1000
features: 2
model = Sequential()
model.add(LSTM(32, input_shape=(50, 2)))
model.add(Dense(2))
even if you want to predict a single time series, you need many example
split the time series into chunks
LSTM: how to actually run it
To be or not to be? this is the question. Whether 'tis nobler in the mind
sequencies of 12 letters
batch size: N
timeseries: 12
features: 1
LSTM: how to actually run it
There is no homework on this cause I am at the end of the semester, but if you want to learn more I will upload an exercise over the weekend where you will train an RNN to generate physics paper titles!
attention
4
Encoder + Decoder architecture
Attention mechanism
Multithreaded attention
Attention is all you need: transformer model
transformer generalized architecture elements
attention mechanism:
v1 | v2 | v3 | v4 | |
---|---|---|---|---|
k1 | 0.1 | 0.2 | 0.0 | 0.1 |
k2 | 0.6 | 0.3 | 0.1 | 0.3 |
k3 | 0.2 | 0.1 | 0.2 | 0.1 |
k4 | 0.6 | 0.9 | 0.1 | 0.8 |
attention mechanism:
a way to relate elements of the time series with each other
v1 | v2 | v3 | v4 | |
---|---|---|---|---|
k1 | 0.1 | 0.2 | 0.0 | 0.1 |
k2 | 0.6 | 0.3 | 0.1 | 0.3 |
k3 | 0.2 | 0.1 | 0.2 | 0.1 |
k4 | 0.6 | 0.9 | 0.1 | 0.8 |
v1 | v2 | v3 | v4 | |
---|---|---|---|---|
k1 | 0.1 | 0.2 | 0.0 | 0.1 |
k2 | 0.6 | 0.3 | 0.1 | 0.3 |
k3 | 0.2 | 0.1 | 0.2 | 0.1 |
k4 | 0.6 | 0.9 | 0.1 | 0.8 |
The cat that ate
was full and happy
was full and happy
attention mechanism:
a way to relate elements of the time series with each other
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
attention mechanism:
a way to relate elements of the time series with each other
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
attention mechanism:
a way to relate elements of the time series with each other
Attention is all you need (2017)
Encoder + Decoder architecture
different elements of the sentence relate to input elements in multiple ways
Multi-headed attention:
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
39
5
903
1238 913 12
W1
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
1238 913 12
W2
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
1238 913 12
W3
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
39
5
903
39
5
903
Attention is all you need (2017)
Encoder + Decoder architecture
Key - Value - Query
attention:
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
1238 913 12
W
39
5
903
The key/value/query concept is analogous to retrieval systems.
key: input
query: output
value... input as well
Attention is all you need (2017)
Encoder + Decoder architecture
attention:
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
1238 913 12
W
39
5
903
The key/value/query concept is analogous to retrieval systems.
project embedding into Key - Value - Query
lower dimensional representations
Attention is all you need (2017)
Encoder + Decoder architecture
attention:
project embedding into Key - Value - Query
lower dimensional representations
Attention is all you need (2017)
Encoder + Decoder architecture
The key/value/query concept is analogous to retrieval systems.
Multi-headed Self attention:
Attention is all you need (2017)
Encoder + Decoder architecture
The key/value/query concept is analogous to retrieval systems.
Multi-headed Self attention:
Attention is all you need (2017)
Dot Product Attention
Attention is all you need (2017)
Scaled Dot Product Attention
the dot-product can produce very large magnitudes with very large vector dimensions (d) which will result in very small gradients when passed into the softmax function, we can scale the values prior (scale = 1 / √ d).
Attention is all you need (2017)
Encoder + Decoder architecture
attention:
project embedding into Key - Value - Query
lower dimensional representations
Encoder Attention
Attention is all you need (2017)
Encoder + Decoder architecture
attention:
project embedding into Key - Value - Query
lower dimensional representations
Decoder Attention
Attention is all you need (2017)
Encoder + Decoder architecture
attention:
project embedding into Key - Value - Query
lower dimensional representations
Encoder-Decoder Attention
transformer model
5
Encoder + Decoder architecture
Encodes the past
Turns out attention is not really _all_ you need...
so far we are working with a "bag of words": the order of words is not known to the model
Attention is all you need (2017)
Encoder + Decoder architecture
positional encoding
Attention is all you need (2017)
Encoder + Decoder architecture
positional encoding
Attention is all you need (2017)
Encoder + Decoder architecture
positional encoding
Attention is all you need (2017)
Attention is all you need
Encoder + Decoder architecture
Encodes the past
Encoder + Decoder architecture
decodes the past and predicts the future
MHA acting on encoder (1)
Attention is all you need (2017)
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
Encoder + Decoder architecture
decodes the past and predicts the future
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism act on previous decoder output,
(2) a multi-head self-attention mechanism act on encoder output,
(3) a positionwise fully connected feed-forward NN
MHA acting on encoder (1)
Attention is all you need (2017)
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
masking dependence on the future
Encoder + Decoder architecture
decodes the past and predicts the future
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism act on previous decoder output,
(2) a multi-head self-attention mechanism act on encoder output,
(3) a positionwise fully connected feed-forward NN
MHA acting on decoder (2)
Attention is all you need (2017)
Attention is all you need (2017)
Encoder + Decoder architecture
Input
Embedding
Positional encoding
Encoder attention
Output
Embedding
Positional encoding
Decoder attention
Encoder-Decoder attention
Feed Forward
Linear
Softmax
Attention is all you need (2017)
GPT3 and society
Vinay Prabhu exposes racist bias in GPT-3
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
Last week, Gebru said she was fired by Google after objecting to a manager’s request to retract or remove her name from the paper. Google’s head of AI said the work “didn’t meet our bar for publication.” Since then, more than 2,200 Google employees have signed a letter demanding more transparency into the company’s handling of the draft. Saturday, Gebru’s manager, Google AI researcher Samy Bengio, wrote on Facebook that he was “stunned,” declaring “I stand by you, Timnit.” AI researchers outside Google have publicly castigated the company’s treatment of Gebru.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
We have identified a wide variety of costs and risks associated with the rush for ever larger LMs, including:
environmental costs (borne typically by those not benefiting from the resulting technology);
financial costs, which in turn erect barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques;
opportunity cost, as researchers pour effort away from directions requiring less resources; and the
risk of substantial harms, including stereotyping, denigration, increases in extremist ideology, and wrongful arrest, should humans encounter seemingly coherent LM output and take it for the words of some person or organization who has accountability for what is said.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
When we perform risk/benefit analyses of language technology, we must keep in mind how the risks and benefits are distributed, because they do not accrue to the same people. On the one hand, it is well documented in the literature on environmental racism that the negative effects of climate change are reaching and impacting the world’s most marginalized communities first [1, 27].
Is it fair or just to ask, for example, that the residents of the Maldives (likely to be underwater by 2100 [6]) or the 800,000 people in Sudan affected by drastic floods pay the environmental price of training and deploying ever larger English LMs, when similar large-scale models aren’t being produced for Dhivehi or Sudanese Arabic?
While the average human is responsible for an estimated 5t CO2 per year, the authors trained a Transformer (big) model [136] with neural architecture search and estimated that the training procedure emitted 284t of CO2.
[...]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
4.1 Size Doesn’t Guarantee Diversity The Internet is a large and diverse virtual space, and accordingly, it is easy to imagine that very large datasets, such as Common Crawl (“petabytes of data collected over 8 years of web crawling”, a filtered version of which is included in the GPT-3 training data) must therefore be broadly representative of the ways in which different people view the world. However, on closer examination, we find that there are several factors which narrow Internet participation [...]
Starting with who is contributing to these Internet text collections, we see that Internet access itself is not evenly distributed, resulting in Internet data overrepresenting younger users and those from developed countries [100, 143]. However, it’s not just the Internet as a whole that is in question, but rather specific subsamples of it. For instance, GPT-2’s training data is sourced by scraping outbound links from Reddit, and Pew Internet Research’s 2016 survey reveals 67% of Reddit users in the United States are men, and 64% between ages 18 and 29. Similarly, recent surveys of Wikipedians find that only 8.8–15% are women or girls [9].
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
4.3 Encoding Bias It is well established by now that large LMs exhibit various kinds of bias, including stereotypical associations [11, 12, 69, 119, 156, 157], or negative sentiment towards specific groups [61]. Furthermore, we see the effects of intersectionality [34], where BERT, ELMo, GPT and GPT-2 encode more bias against identities marginalized along more than one dimension than would be expected based on just the combination of the bias along each of the axes [54, 132].
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said.
LAB:
show that the keras example of time series analysis with tensorflow..... is wrong!!
- visualize and familiarize with the data (which the authors of the notebook had not done)
- create a model like the one that they created (takes a while to train, I saved a pretrained version for you)
- look a the loss, which they did not do
- remove the attention block and compare the models
https://github.com/fedhere/MLPNS_FBianco/blob/main/transformers/assess_TS_classification_w_tensorflow.ipynb
Text
The FordA dataset
This data was originally used in a competition in the IEEE World Congress on Computational Intelligence, 2008. The classification problem is to diagnose whether a certain symptom exists or does not exist in an automotive subsystem. Each case consists of 500 measurements of engine noise and a classification. There are two separate problems: For FordA the Train and test data set were collected in typical operating conditions, with minimal noise contamination.
A video on transformer which I think is really good!
https://www.youtube.com/watch?v=4Bdc55j80l8
A video on attention (with a different accent than the one I subjected you all this time!)
https://www.youtube.com/watch?v=-9vVhYEXeyQ
Tutorial
By federica bianco
transformers