[Summary] ContraNorm: A Contrastive Learning Per-spective on Oversmoothing and beyond

TL;DR Oversmoothing is a common phenomenon in Transformers, where performance worsens due dimensional collapse of representations, where representations lie in a narrow cone in the feature space. The authors analyze the Contrastive lose and extracted the term that prevent this collapse. By taking a gradient descent step with respect to the feature later, they come up with the ContraNorm layer that leads to a more uniform distribution and prevents the dimensional collapse....

February 1, 2025 · 2 min · 400 words