IST-Unbabel 2021 Submission
for the Explainable QE Shared Task
November, 2021
Marcos V. Treviso*
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
deepspin
🎯 identify translation errors via explainability methods
Explainable QE Shared Task
Pronksiajal võeti kasutusele pronksist tööriistad , ent käepidemed valmistati ikka puidust .
Bronking tools were introduced during the long term, but handholds were still made up of wood .
0.58
sentence-level QE
0.8 0.5 0.6 0.7 0.4 0.2 0.3 0.6 0.1 0.2 0.2
source scores
(source)
(translation)
explainer
0.9 0.6 0.6 0.8 0.5 0.5 0.6 0.7 0.2 0.1 0.9 0.2 0.1 0.3 0.5 0.6 0.1 0.5
translation scores
Constrained track:
without word-level supervision
Explainable QE Shared Task
sentence-level QE
explainer
0.9 0.6 0.1 ...
word-level QE
+ sentence loss
0.9 0.6 0.1 ...
BAD OK BAD OK BAD ...
(word-level tags)
Unconstrained track:
with word-level supervision
predicted probabilities
Sentence-level models
2-layered MLP → \(\hat{y} \in \mathbb{R}\)
convex combination
independently average
source (\(h_{src}\)) and translation (\(h_{hyp}\))
weighted sum of the hidden states of each layer of the encoder
e.g., multilingual BERT
Sentence-level models
2-layered MLP → \(\hat{y} \in \mathbb{R}\)
convex combination
independently average
source (\(h_{src}\)) and translation (\(h_{hyp}\))
weighted sum of the hidden states of each layer of the encoder
e.g., multilingual BERT
Sentence-level models
• XLM-RoBERTa (XLM-R)
finetuned on all 7 language pairs from the MLQE-PE dataset
• RemBERT
finetuned on all 7 language pairs from the MLQE-PE dataset
• XLM-RoBERTa-Metrics (XLM-R-M)
finetuned on 30 language pairs from the Metrics shared task
(DE-ZH and RU-DE not included)
Explainability methods
• Attention-based
attention weights
cross-attention weights
attention weights × L2 norm of value vectors [1]
• Gradient-based
gradient × hidden state vector
gradient × attention output
integrated gradients [2]
• Perturbation-based
LIME [3]
erasure
• Rationalizers
Relaxed-Bernoulli (reparam. trick)
Results on the validation set
• Attention heads are better alone
Results on the validation set
• Attention heads are better alone
Results on the validation set
• Attention heads are better alone
Results on the validation set
• Attention heads are better alone
Results on the validation set
• Attention heads are better alone
Results on the validation set
• Attention heads are better alone
Results on the validation set
• Attention heads are better alone
Results on the validation set
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
Results on the validation set
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
Results on the validation set
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
Results on the validation set
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
Results on the validation set
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
Results on the validation set
• Attention × Norm outperforms all
Source AUC (ET-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.74 .70 .66 .62 .58 .54 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.74 .70 .66 .62 .58 .54 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (ET-EN)
Very similar findings for ET-EN
Results on the validation set
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
Results on the validation set
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
en
ro
ro
en
Results on the validation set
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
en
ro
ro
en
Results on the validation set
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
layer 18 - head 3: high target AUC
Results on the validation set
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
layer 18 - head 3: high target AUC
Results on the validation set
• Ensembling attention × norm explainers usually helps
Source AUC (RO-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
Target AUC (RO-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
Source AUC (ET-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
Target AUC (ET-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
Official test set results (AUC)
RO-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
ET-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
RU-DE
constr.
unconstr.
.88 .85 .82 .79 .76 .73 .70
constr.
unconstr.
source
target
DE-ZH
constr.
unconstr.
.79 .76 .73 .70 .67 .64 .61
constr.
unconstr.
source
target
Official test set results (AUC)
RO-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
ET-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
RU-DE
constr.
unconstr.
.88 .85 .82 .79 .76 .73 .70
constr.
unconstr.
source
target
DE-ZH
constr.
unconstr.
.79 .76 .73 .70 .67 .64 .61
constr.
unconstr.
source
target
Final remarks
• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models
- attention heads perform better alone
- the norm of value vectors provides valuable information
Final remarks
• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models
- attention heads perform better alone
- the norm of value vectors provides valuable information
• Ensembled-explanations usually achieve better results
Final remarks
• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models
- attention heads perform better alone
- the norm of value vectors provides valuable information
• Ensembled-explanations usually achieve better results
• QE as a rationale extraction task is a promising direction
- specially for language pairs with limited amount of data
Thank you!
ist-unbabel expl QE shared task - live talk
By mtreviso
ist-unbabel expl QE shared task - live talk
- 180