TY  - JOUR
AU  - Burger, Martin
AU  - Kabri, Samira
AU  - Korolev, Yury
AU  - Roith, Tim
AU  - Weigand, Lukas
TI  - Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization
JO  - Philosophical transactions of the Royal Society of London / Series A
VL  - 383
IS  - 2298
SN  - 1364-503X
CY  - London
PB  - Royal Soc.
M1  - PUBDB-2025-01273
SP  - 20240233
PY  - 2025
N1  - ISSN 1471-2962 not unique: **2 hits**.
AB  - The aim of this paper is to provide a mathematical analysis of transformer architectures using aself-attention mechanism with layer normalization. In particular, observed patterns in such architecturesresembling either clusters or uniform distributions pose a number of challenging mathematical questions.We focus on a special case that admits a gradient flow formulation in the spaces of probability measureson the unit sphere under a special metric, which allows us to give at least partial answers in a rigorousway. The arising mathematical problems resemble those recently studied in aggregation equations, butwith additional challenges emerging from restricting the dynamics to the sphere and the particular formof the interaction energy.We provide a rigorous framework for studying the gradient flow, which also suggests a possible metricgeometry to study the general case (i.e. one that is not described by a gradient flow). We further analyzethe stationary points of the induced self-attention dynamics. The latter are related to stationary pointsof the interaction energy in the Wasserstein geometry, and we further discuss energy minimizers andmaximizers in different parameter settings.
LB  - PUB:(DE-HGF)16
DO  - DOI:10.1098/rsta.2024.0233
UR  - https://bib-pubdb1.desy.de/record/626051
ER  -