IST-Unbabel 2021 Submission

for the Explainable QE Shared Task

November, 2021

Marcos V. Treviso*

Nuno M. Guerreiro

Ricardo Rei

André F. T. Martins

deepspin

🎯 identify translation errors via explainability methods

Explainable QE Shared Task

Pronksiajal võeti kasutusele pronksist tööriistad , ent käepidemed valmistati ikka puidust .

Bronking tools were introduced during the long term, but handholds were still made up of wood .

0.58

sentence-level QE

0.8 0.5 0.6 0.7 0.4 0.2 0.3 0.6 0.1 0.2 0.2

source scores

(source)

(translation)

explainer

0.9 0.6 0.6 0.8 0.5 0.5 0.6 0.7 0.2 0.1 0.9 0.2 0.1 0.3 0.5 0.6 0.1 0.5

translation scores

Constrained track:
without word-level supervision

Explainable QE Shared Task

sentence-level QE

explainer

0.9 0.6 0.1 ...

word-level QE

+ sentence loss

0.9 0.6 0.1 ...

BAD OK BAD OK BAD ...

(word-level tags)

Unconstrained track:
with word-level supervision

predicted probabilities

Sentence-level models

2-layered MLP → \(\hat{y} \in \mathbb{R}\)

convex combination

independently average
source (\(h_{src}\)) and translation (\(h_{hyp}\))

weighted sum of the hidden states of each layer of the encoder

e.g., multilingual BERT

Sentence-level models

2-layered MLP → \(\hat{y} \in \mathbb{R}\)

convex combination

independently average
source (\(h_{src}\)) and translation (\(h_{hyp}\))

weighted sum of the hidden states of each layer of the encoder

e.g., multilingual BERT

Sentence-level models

• XLM-RoBERTa (XLM-R)
finetuned on all 7 language pairs from the MLQE-PE dataset

• RemBERT
finetuned on all 7 language pairs from the MLQE-PE dataset

• XLM-RoBERTa-Metrics (XLM-R-M)
finetuned on 30 language pairs from the Metrics shared task

(DE-ZH and RU-DE not included)

Explainability methods

• Attention-based

attention weights

cross-attention weights

attention weights × L2 norm of value vectors [1]

• Gradient-based

gradient × hidden state vector

gradient × attention output

integrated gradients [2]

• Perturbation-based

LIME [3]

erasure

• Rationalizers

Relaxed-Bernoulli (reparam. trick)

[1] Kobayashi, Goro, et al. "Attention is not only a weight: Analyzing transformers with vector norms." EMNLP (2020)

[2] Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. "Axiomatic attribution for deep networks." ICML (2017)

[3] Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "" Why should i trust you?" Explaining the predictions of any classifier." SIGKDD (2016).

Results on the validation set

• Attention heads are better alone

Results on the validation set

• Attention heads are better alone

Results on the validation set

• Attention heads are better alone

Results on the validation set

• Attention heads are better alone

Results on the validation set

• Attention heads are better alone

Results on the validation set

• Attention heads are better alone

Results on the validation set

• Attention heads are better alone

Results on the validation set

• Attention × Norm outperforms all

Source AUC (RO-EN)

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.80 .75 .70 .65 .60 .55 .50

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.85 .80 .75 .70 .65 .60 .55

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Target AUC (RO-EN)

Results on the validation set

• Attention × Norm outperforms all

Source AUC (RO-EN)

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.80 .75 .70 .65 .60 .55 .50

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.85 .80 .75 .70 .65 .60 .55

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Target AUC (RO-EN)

Results on the validation set

• Attention × Norm outperforms all

Source AUC (RO-EN)

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.80 .75 .70 .65 .60 .55 .50

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.85 .80 .75 .70 .65 .60 .55

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Target AUC (RO-EN)

Results on the validation set

• Attention × Norm outperforms all

Source AUC (RO-EN)

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.80 .75 .70 .65 .60 .55 .50

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.85 .80 .75 .70 .65 .60 .55

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Target AUC (RO-EN)

Results on the validation set

• Attention × Norm outperforms all

Source AUC (RO-EN)

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.80 .75 .70 .65 .60 .55 .50

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.85 .80 .75 .70 .65 .60 .55

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Target AUC (RO-EN)

Results on the validation set

• Attention × Norm outperforms all

Source AUC (ET-EN)

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.74 .70 .66 .62 .58 .54 .50

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Attention

Cross-attention

Gradient
×
Attention

Gradient ×
Hidden

.74 .70 .66 .62 .58 .54 .50

Integrated
Gradients

LIME

Erasure

Bernoulli Rationalizer

Attention
×
Norm

Target AUC (ET-EN)

Very similar findings for ET-EN

Results on the validation set

• Attention heads with high source/target AUC scores

layer 18 - head 0: high source AUC

Results on the validation set

• Attention heads with high source/target AUC scores

layer 18 - head 0: high source AUC

en

ro

ro

en

Results on the validation set

• Attention heads with high source/target AUC scores

layer 18 - head 0: high source AUC

en

ro

ro

en

Results on the validation set

• Attention heads with high source/target AUC scores

layer 18 - head 0: high source AUC

layer 18 - head 3: high target AUC

Results on the validation set

• Attention heads with high source/target AUC scores

layer 18 - head 0: high source AUC

layer 18 - head 3: high target AUC

Results on the validation set

• Ensembling attention × norm explainers usually helps

Source AUC (RO-EN)

XLM-R

XLM-R-M

Ensemble

1.0 .90 .80 .70 .60 .50 .40

RemBERT

Target AUC (RO-EN)

XLM-R

XLM-R-M

Ensemble

1.0 .90 .80 .70 .60 .50 .40

RemBERT

Source AUC (ET-EN)

XLM-R

XLM-R-M

Ensemble

1.0 .90 .80 .70 .60 .50 .40

RemBERT

Target AUC (ET-EN)

XLM-R

XLM-R-M

Ensemble

1.0 .90 .80 .70 .60 .50 .40

RemBERT

Official test set results (AUC)

RO-EN

constr.

unconstr.

.95 .92 .89 .86 .83 .80 .77

constr.

unconstr.

source

target

ET-EN

constr.

unconstr.

.95 .92 .89 .86 .83 .80 .77

constr.

unconstr.

source

target

RU-DE

constr.

unconstr.

.88 .85 .82 .79 .76 .73 .70

constr.

unconstr.

source

target

DE-ZH

constr.

unconstr.

.79 .76 .73 .70 .67 .64 .61

constr.

unconstr.

source

target

Official test set results (AUC)

RO-EN

constr.

unconstr.

.95 .92 .89 .86 .83 .80 .77

constr.

unconstr.

source

target

ET-EN

constr.

unconstr.

.95 .92 .89 .86 .83 .80 .77

constr.

unconstr.

source

target

RU-DE

constr.

unconstr.

.88 .85 .82 .79 .76 .73 .70

constr.

unconstr.

source

target

DE-ZH

constr.

unconstr.

.79 .76 .73 .70 .67 .64 .61

constr.

unconstr.

source

target

Final remarks

• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models

- attention heads perform better alone
- the norm of value vectors provides valuable information

Final remarks

• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models

- attention heads perform better alone
- the norm of value vectors provides valuable information

• Ensembled-explanations usually achieve better results

Final remarks

• Attention-based methods seem stronger than other methods for producing plausible explanations for QE models

- attention heads perform better alone
- the norm of value vectors provides valuable information

• Ensembled-explanations usually achieve better results

• QE as a rationale extraction task is a promising direction

- specially for language pairs with limited amount of data

Thank you!

code:

https://github.com/deep-spin/explainable-qe-shared-task