On the weak link between importance and prunability of attention heads
Aakriti Budhraja, Madhura Pande, Preksha Nema,
Pratyush Kumar, Mitesh M. Khapra
Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI)
IIT Madras, India
Interpretation of the roles of attention heads
Model size reduction for better efficiency
- Position: Component/Layer
- Behaviour: Syntactic/Local
- Pruning
- Distillation
- Parameter sharing
Our work
Can we randomly prune attention heads, thus completely ignoring all the notions of ‘importance of heads’?
Apparently, Yes!
-
A large fraction of attention heads can be randomly pruned with negligible effect on accuracy in both Transformer and BERT models.
-
Transformer - No advantage in pruning attention heads which were identified to be important based on the existing studies that marked some heads as important based on their position in the network.
-
BERT - No preference for the top or bottom layers, though that latter are reported to have higher importance. However, strategies that avoid pruning middle layers and consecutive layers perform better.
-
During fine-tuning the compensation for pruned attention heads is roughly uniformly distributed across the un-pruned heads.
Our Findings
Models and Datasets
We perform multiple experiments on the following models and tasks and compare with the results published in the recent literature:
-
Transformer
- Task: Machine Translation
- Dataset: WMT’14 (EN-RU, EN-DE)
-
BERT
-
GLUE Tasks: Sentence entailment, Question similarity, Question Answering, Movie Review
-
Dataset: MNLI-m, QQP, QNLI, SST-2
-
Experimental Process
- We perform random pruning, where a subset of attention heads which are chosen by random sampling are masked (zeroed) out.
-
We multiply each attention head by a mask, which is set to 0 if that particular head is to be pruned and 1 otherwise.
- Thus, the output of an attention head is given by:
- We fine-tune both the Transformer and BERT models, after pruning the attention heads.
Effect of Random Pruning
-
For Transformer, across EN-RU and EN-DE tasks, 60% of the attention heads can be pruned with a maximum drop in BLEU score by just 0.15 points.
-
For BERT, half of the attention heads can be pruned with an average accuracy drop under 1% for the GLUE tasks.
Transformer
Pruning Based on Layer Numbers
One of the recent works on Transformers[1] has identified the attention heads of the following layers as important:
-
Lower layers of Encoder Self-attention
-
Higher layers of Encoder-Decoder attention
-
Lower layers of Decoder Self-attention
What happens when we prune the attention heads of these important layers?
emnlp deck
By aakritibudhraja
emnlp deck
- 324