LLM | Koby Bibas

[Summary] LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

TL;DR State-of-the-art language models are primarily decoder-only, focusing on token prediction rather than producing rich contextualized embeddings for downstream tasks. LLM2Vec introduces an unsupervised method to transform decoder-only models into encoders. This approach involves: (i) enabling bidirectional attention, (ii) training on masked token prediction, and (iii) incorporating unsupervised contrastive learning. The result is that these converted models outperform traditional encoder-only models. Background Until recently, large language models (LLMs) were predominantly based on bidirectional encoders or encoder-decoder frameworks like BERT and T5....

[Summary] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

TL;DR Generative Large Language Models (LLMs) are limited to generate text based on their training data which means any extension to additional sources necessitates additional training. Retrieval Augmented Generation (RAG) is a method that combines the use of a database with LLMs enables updating the LLM knowledge and make it more precise for specific applications. Method Building blocks The method consists of 3 building blocks. Document index. A pre-trained model was used to encode documents into embeddings to create the index....

[Summary] Direct Preference Optimization (DPO)

TL;DR Direct Preference Optimization is a method of fine-tuning Large Language Models (LLM) to better align their outputs with human preference. It’s used as a simpler alternative to RLHF since it can be directly applied to the model without needing a reward function nor reinforcement learning optimization. Method The authors propose to re-parameterize the reward model of RLHF to obtain the optimal policy in closed form. This enables to solve the standard RLHF problem using a simple classification loss....

[Concept] Reinforcement learning from human feedback (RLHF)

TL;DR Machine learning models require a loss function to tune their parameters. Designing a loss function to reflect ambiguous human values poses a challenge, e.g., it’s not clear how to formulate a loss function to represent what is funny or ethical. To this end, a reward model is trained via human feedback. This reward model takes the model’s output and predicts a reward score that is then used by the model to optimize its parameters....