[Summary] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
TL;DR Machine learning model evaluation commonly reports the “highest number” often lacking any kind of statistical significance. This creates misleading comparisons, especially when differences between models are small. The paper overviews the different methods of adding statistical error bars to evals, covering independent and clustered questions, paired model comparisons, and power analysis. These tools help quantify uncertainty and avoid overconfident claims about which model is better. Motivation LLM evals often treat the top score as definitive....