[Summary] Why Less is More (Sometimes): A Theory of Data Curation
TL;DR Training LLMs often requiring hundreds of billions of tokens, yet not all data points contribute equally to learning: while some accelerate progress, others are redundant or even detrimental. The paper builds a theory for when a smaller, curated dataset can outperform using all available data in high dimensional learning. It models a label generator, and pruning oracle, then derives test error scaling laws showing that “less is more” only in a specific regime where data is abundant and the label generator is strong, while “more is more” remains optimal in most other regimes....