Aakriti Budhraja, Madhura Pande, Preksha Nema,
Pratyush Kumar, Mitesh M. Khapra
Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI)
IIT Madras, India
Interpretation of the roles of attention heads
Model size reduction for better efficiency
Our work
Can we randomly prune attention heads, thus completely ignoring all the notions of ‘importance of heads’?
Apparently, Yes!
A large fraction of attention heads can be randomly pruned with negligible effect on accuracy in both Transformer and BERT models.
Transformer - No advantage in pruning attention heads which were identified to be important based on the existing studies that marked some heads as important based on their position in the network.
BERT - No preference for the top or bottom layers, though that latter are reported to have higher importance. However, strategies that avoid pruning middle layers and consecutive layers perform better.
During fine-tuning the compensation for pruned attention heads is roughly uniformly distributed across the un-pruned heads.
We perform multiple experiments on the following models and tasks and compare with the results published in the recent literature:
BERT
GLUE Tasks: Sentence entailment, Question similarity, Question Answering, Movie Review
Dataset: MNLI-m, QQP, QNLI, SST-2
We multiply each attention head by a mask, which is set to 0 if that particular head is to be pruned and 1 otherwise.