[Summary] Perception Encoder: The best visual embeddings are not at the output of the network

TL;DR Current beliefs suggest that a single vision model cannot achieve SOTA performance across language-centric and spatial tasks. Perception Encoder (PE) is a vision encoder demonstrates that contrastive vision-language pretraining provides versatile features suitable for both multimodal language modeling and dense spatial prediction. These diverse capabilities reside within intermediate layers rather than the model output. The authors fine-tune two PE variants to migrate these hidden representations to the final layer: one optimized for language tasks and another for spatial tasks....

December 28, 2025 · 4 min · 749 words

[Summary] ContraNorm: A Contrastive Learning Per-spective on Oversmoothing and beyond

TL;DR Oversmoothing is a common phenomenon in Transformers, where performance worsens due dimensional collapse of representations, where representations lie in a narrow cone in the feature space. The authors analyze the Contrastive lose and extracted the term that prevent this collapse. By taking a gradient descent step with respect to the feature later, they come up with the ContraNorm layer that leads to a more uniform distribution and prevents the dimensional collapse....

February 1, 2025 · 2 min · 400 words