Skip to main navigation Skip to search Skip to main content

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  • Michael Hassid
  • , Hao Peng
  • , Daniel Rotem
  • , Jungo Kasai
  • , Ivan Montero
  • , Noah A. Smith
  • , Roy Schwartz

Research output: Contribution to conferencePaperpeer-review

16 Scopus citations

Abstract

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones-the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance-an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.

Original languageEnglish
Pages1403-1416
Number of pages14
StatePublished - 2022
Event2022 Findings of the Association for Computational Linguistics: EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: 7 Dec 202211 Dec 2022

Conference

Conference2022 Findings of the Association for Computational Linguistics: EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period7/12/2211/12/22

Bibliographical note

Publisher Copyright:
© 2022 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers'. Together they form a unique fingerprint.

Cite this