Conference

Last week I attended the CVPR conference, a gathering of computer vision researchers and professionals showcasing the latest advancements in the field. Some interesting recent trends: Multimodal models and datasets Large Language Models (LLMs) are being used to train vision models Images are used to ground LLMs, reducing their hallucination Models are being fed with both images and videos to achieve better results Foundation models are commodity These models are becoming more accessible and less expensive to create They are trained on multiple modalities and tasks (even for a very niche tasks like hand pose estimation) Transformers are everywhere: While not a new trend, it’s still notable that attention mechanisms are incorporated into almost every model....