IST-Unbabel 2021 Submission
for the Explainable QE Shared Task
November, 2021
Marcos V. Treviso*
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
deepspin
🎯 identify translation errors via explainability methods
Pronksiajal võeti kasutusele pronksist tööriistad , ent käepidemed valmistati ikka puidust .
Bronking tools were introduced during the long term, but handholds were still made up of wood .
0.58
sentence-level QE
0.8 0.5 0.6 0.7 0.4 0.2 0.3 0.6 0.1 0.2 0.2
source scores
(source)
(translation)
explainer
0.9 0.6 0.6 0.8 0.5 0.5 0.6 0.7 0.2 0.1 0.9 0.2 0.1 0.3 0.5 0.6 0.1 0.5
translation scores
Constrained track:
without word-level supervision
sentence-level QE
explainer
0.9 0.6 0.1 ...
word-level QE
+ sentence loss
0.9 0.6 0.1 ...
BAD OK BAD OK BAD ...
(word-level tags)
Unconstrained track:
with word-level supervision
predicted probabilities
2-layered MLP → \(\hat{y} \in \mathbb{R}\)
convex combination
independently average
source (\(h_{src}\)) and translation (\(h_{hyp}\))
weighted sum of the hidden states of each layer of the encoder
e.g., multilingual BERT
2-layered MLP → \(\hat{y} \in \mathbb{R}\)
convex combination
independently average
source (\(h_{src}\)) and translation (\(h_{hyp}\))
weighted sum of the hidden states of each layer of the encoder
e.g., multilingual BERT
• XLM-RoBERTa (XLM-R)
finetuned on all 7 language pairs from the MLQE-PE dataset
• RemBERT
finetuned on all 7 language pairs from the MLQE-PE dataset
• XLM-RoBERTa-Metrics (XLM-R-M)
finetuned on 30 language pairs from the Metrics shared task
(DE-ZH and RU-DE not included)
• Attention-based
attention weights
cross-attention weights
attention weights × L2 norm of value vectors [1]
• Gradient-based
gradient × hidden state vector
gradient × attention output
integrated gradients [2]
• Perturbation-based
LIME [3]
erasure
• Rationalizers
Relaxed-Bernoulli (reparam. trick)
• Attention heads are better alone
• Attention heads are better alone
• Attention heads are better alone
• Attention heads are better alone
• Attention heads are better alone
• Attention heads are better alone
• Attention heads are better alone
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
• Attention × Norm outperforms all
Source AUC (RO-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.80 .75 .70 .65 .60 .55 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.85 .80 .75 .70 .65 .60 .55
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (RO-EN)
• Attention × Norm outperforms all
Source AUC (ET-EN)
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.74 .70 .66 .62 .58 .54 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Attention
Cross-attention
Gradient
×
Attention
Gradient ×
Hidden
.74 .70 .66 .62 .58 .54 .50
Integrated
Gradients
LIME
Erasure
Bernoulli Rationalizer
Attention
×
Norm
Target AUC (ET-EN)
Very similar findings for ET-EN
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
en
ro
ro
en
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
en
ro
ro
en
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
layer 18 - head 3: high target AUC
• Attention heads with high source/target AUC scores
layer 18 - head 0: high source AUC
layer 18 - head 3: high target AUC
• Ensembling attention × norm explainers usually helps
Source AUC (RO-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
Target AUC (RO-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
Source AUC (ET-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
Target AUC (ET-EN)
XLM-R
XLM-R-M
Ensemble
1.0 .90 .80 .70 .60 .50 .40
RemBERT
RO-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
ET-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
RU-DE
constr.
unconstr.
.88 .85 .82 .79 .76 .73 .70
constr.
unconstr.
source
target
DE-ZH
constr.
unconstr.
.79 .76 .73 .70 .67 .64 .61
constr.
unconstr.
source
target
RO-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
ET-EN
constr.
unconstr.
.95 .92 .89 .86 .83 .80 .77
constr.
unconstr.
source
target
RU-DE
constr.
unconstr.
.88 .85 .82 .79 .76 .73 .70
constr.
unconstr.
source
target
DE-ZH
constr.
unconstr.
.79 .76 .73 .70 .67 .64 .61
constr.
unconstr.
source
target
• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models
- attention heads perform better alone
- the norm of value vectors provides valuable information
• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models
- attention heads perform better alone
- the norm of value vectors provides valuable information
• Ensembled-explanations usually achieve better results
• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models
- attention heads perform better alone
- the norm of value vectors provides valuable information
• Ensembled-explanations usually achieve better results
• QE as a rationale extraction task is a promising direction
- specially for language pairs with limited amount of data