Transformer

[Summary] VGGT: Visual Geometry Grounded Transformer

TL;DR Traditional 3D reconstruction relied on iterative visual-geometry optimization (e.g., Bundle Adjustment). Recent work explored integrating machine learning via differentiable Bundle Adjustment, but remained slow and limited. VGGT (Visual Geometry Grounded Transformer) is a large feed-forward transformer that predicts all key 3D scene attributes—camera parameters, depth maps, point maps, and 3D point tracks—directly from one or many images in a single forward pass. It removes the need for geometry processing, achieves state-of-the-art results in multiple benchmarks, and runs in under a second....

[Summary] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

TL;DR Transformer models often map multiple concepts to the same neuron, making it unclear what features they learn. This work makes inner representations interpretable by using a sparse autoencoder layer to map neurons to concepts. This method extracts relatively monosemantic concepts, can steer transformer generation, and shows that 512 neurons can represent tens of thousands of features. Method A major challenge in reverse engineering neural networks is the curse of dimensionality: as models grow, the latent space volume increases exponentially....

[Summary] ContraNorm: A Contrastive Learning Per-spective on Oversmoothing and beyond

TL;DR Oversmoothing is a common phenomenon in Transformers, where performance worsens due dimensional collapse of representations, where representations lie in a narrow cone in the feature space. The authors analyze the Contrastive lose and extracted the term that prevent this collapse. By taking a gradient descent step with respect to the feature later, they come up with the ContraNorm layer that leads to a more uniform distribution and prevents the dimensional collapse....

[Summary] Fine-Grained Fashion Similarity Prediction by Attribute-Specific Embedding Learning

TL;DR In the fashion domain, visually distinct products may share fine-grained attributes like sleeve length or collar shape. Traditional methods of finding similar products often overlook these details, leading to irrelevant results for the user. To address this, the authors propose a model with two branches: a global branch that processes the entire image and a local branch takes a Region Of Interest (ROI) of specific attributes, identified through a spatial attention layer in the global branch....