On the weak link between importance and prunability of attention heads

Aakriti Budhraja, Madhura Pande, Preksha Nema,

Pratyush Kumar, Mitesh M. Khapra

 

Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI)

IIT Madras, India

Interpretation of the roles of attention heads

Model size reduction for better efficiency

  • Position: Component/Layer
  • Behaviour: Syntactic/Local
  • Pruning
  • Distillation
  • Parameter sharing

Our work

Can we randomly prune attention heads, thus completely ignoring all the notions of ‘importance of heads’?

Apparently, Yes!

  • A large fraction of attention heads can be randomly pruned with negligible effect on accuracy in both Transformer and BERT models.

  • Transformer - No advantage in pruning attention heads which were identified to be important based on the existing studies that marked some heads as important based on their position in the network.

  • BERT - No preference for the top or bottom layers, though that latter are reported to have higher importance. However, strategies that avoid pruning middle layers and consecutive layers perform better.

  • During fine-tuning the compensation for pruned attention heads is roughly uniformly distributed across the un-pruned heads.

Our Findings

Models and Datasets

We perform multiple experiments on the following models and tasks and compare with the results published in the recent literature:

 

  • Transformer
    • Task: Machine Translation
    • Dataset: WMT’14 (EN-RU, EN-DE)
  • BERT

    • GLUE Tasks: Sentence entailment, Question similarity, Question Answering, Movie Review

    • Dataset: MNLI-m, QQP, QNLI, SST-2

Experimental Process

  • We perform random pruning, where a subset of attention heads which are chosen by random sampling are masked (zeroed) out.
  • We multiply each attention head by a mask, which is set to 0 if that particular head is to be pruned and 1 otherwise.

  • Thus, the output of an attention head is given by:
\xi \textrm{softmax}\left(\frac{(XW^q)(XW^k)^T}{\sqrt{d_k}}\right) XW^v
  • We fine-tune both the Transformer and BERT models, after pruning the attention heads.

Effect of Random Pruning

  • For Transformer, across EN-RU and EN-DE tasks, 60% of the attention heads can be pruned with a maximum drop in BLEU score by just 0.15 points.

  • For BERT, half of the attention heads can be pruned with an average accuracy drop under 1% for the GLUE tasks.

Transformer

Pruning Based on Layer Numbers

One of the recent works on Transformers[1] has identified the attention heads of the following layers as important:

  • Lower layers of Encoder Self-attention

  • Higher layers of Encoder-Decoder attention

  • Lower layers of Decoder Self-attention

What happens when we prune the attention heads of these important layers?

Made with Slides.com