TL;DR AI coding agents typically produce code, and a human reviewer verifies it or tests catch regressions. AutoResearch is a different paradigm: an agent iterates on an ML model to improve a continuous metric. Its success comes from constraining the agent to a fixed time budget and defining an evaluation that is invariant to the code changes it is allowed to make. The implications go beyond ML models: the same approach could apply to latency, capacity, or any system with a measurable objective.
Motivation
The bottleneck in ML research isn’t ideas — it’s the cost of running experiments. Each iteration requires implementation, a full training run, and manual evaluation. AutoResearch addresses this by making experiments cheap, time-bounded, and automatically reversible, turning the iteration loop into something an agent can run overnight.
Method
AutoResearch succeeds because of a tightly constrained setup with one specific job:
- Let an agent modify a training script
- Run a bounded experiment (by time)
- Measure the result using a fixed metric
- Keep the change if it helps, discard it if it does not
Out of scope for the agent: browsing papers or coming up with novel theories.
Karpathy’s repository contains 3 files only:
prepare.py— fixed harness: dataset, tokenizer, dataloader, evaluation. Agent cannot touch this.train.py— the only file the agent is allowed to change.Program.md— instructions for the agent.
What makes the constraints work:
- Metric: eval uses bits per byte (
val_bpb) rather than raw loss, making results tokenizer-invariant. - Simplicity bias: a small improvement with added complexity is not worth it; equal performance with less code is a win.
- Frontier tracking: all experiments are logged to
results.tsv, which serves as the agent’s file-based memory:
commit val_bpb memory_gb status description
a1b2c3d 0.9979 44.0 keep baseline
b2c3d4e 0.9932 44.2 keep increase LR to 0.04
c3d4e5f 1.0050 44.0 discard switch to GeLU activation
- Single long-running session: each iteration starts from the current best commit. Output is redirected to
run.logto avoid flooding the context window. - Time-bounded experiments: ~5 min target, killed at 10 min and treated as failure.
- Never stop: the agent is explicitly instructed not to pause for human confirmation once the loop starts.
Limitations
- Single continuous session: context degrades over many iterations.
- Finds local optima, not novel approaches — creativity is fully out of scope.
- Metric quality is still a human judgment call; a bad metric means wasted compute.
- Only works when experiments are fast and cheap to run.
Related Approaches
Evolver — a more sophisticated take: the user can propose hypotheses, and multiple agents run experiments in parallel. More powerful, but less hands-off.
Ralph Wiggum — similar loop paradigm but with binary criteria: the agent iterates until a fixed test passes rather than optimizing a continuous metric. Same principle, different termination logic.
Claude Autoresearch — a Claude Code skill that generalizes AutoResearch beyond ML to any domain with a measurable metric. Adds a Guard mechanism: a secondary command that must pass to prevent regressions while the primary metric improves.