On the weak link between importance and prunability of attention heads

Aakriti Budhraja, Madhura Pande, Preksha Nema,

Pratyush Kumar, Mitesh M. Khapra

Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI)

IIT Madras, India

Interpretation of the roles of attention heads

Model size reduction for better efficiency

Our work

Can we randomly prune attention heads, thus completely ignoring all the notions of ‘importance of heads’?

Apparently, Yes!

A large fraction of attention heads can be randomly pruned with negligible effect on accuracy in both Transformer and BERT models.
Transformer - No advantage in pruning attention heads which were identified to be important based on the existing studies that marked some heads as important based on their position in the network.
BERT - No preference for the top or bottom layers, though that latter are reported to have higher importance. However, strategies that avoid pruning middle layers and consecutive layers perform better.
During fine-tuning the compensation for pruned attention heads is roughly uniformly distributed across the un-pruned heads.

Our Findings

We perform multiple experiments on the following models and tasks and compare with the results published in the recent literature:

Transformer
- Task: Machine Translation
- Dataset: WMT’14 (EN-RU, EN-DE)
BERT
- GLUE Tasks: Sentence entailment, Question similarity, Question Answering, Movie Review
- Dataset: MNLI-m, QQP, QNLI, SST-2

We perform random pruning, where a subset of attention heads which are chosen by random sampling are masked (zeroed) out.
We multiply each attention head by a mask, which is set to 0 if that particular head is to be pruned and 1 otherwise.
Thus, the output of an attention head is given by:

\xi \textrm{softmax}\left(\frac{(XW^q)(XW^k)^T}{\sqrt{d_k}}\right) XW^v

We fine-tune both the Transformer and BERT models, after pruning the attention heads.