Karpathy's Autoresearch — Lesson
Study notes from exploring karpathy/autoresearch. The goal: extract the general pattern behind autonomous LLM-driven research and map it onto other problem domains (including radiology ML).
1. TL;DR — The formula distilled
Section titled “1. TL;DR — The formula distilled”Karpathy’s autoresearch is not really about “LLMs training LLMs.” It is a more general recipe:
Autonomous optimization = (Frozen evaluator) + (Narrow mutation surface) + (Bounded trial cost) + (Reversible state via version control) + (Persistent log / memory) + (Autonomy directive — no human-in-the-loop)The LLM is just the search operator. The cleverness is in the environment. Strip out the “train a GPT” domain, and the same scaffold drives prompt optimization, query tuning, medical model search, build-time reduction, etc.
2. How the Karpathy mechanism actually works
Section titled “2. How the Karpathy mechanism actually works”2.1 Three files, three roles
Section titled “2.1 Three files, three roles”┌────────────────────┬────────────────────────────┬─────────────────────────┐│ File │ Mutable by │ Role in the loop │├────────────────────┼────────────────────────────┼─────────────────────────┤│ prepare.py │ Nobody (read-only) │ Frozen evaluator ││ train.py │ LLM agent │ Mutation surface ││ program.md │ Human researcher │ Agent policy / "org" │└────────────────────┴────────────────────────────┴─────────────────────────┘prepare.pyencodes the ground-truth fitness function (evaluate_bpb) and runtime constants (seq len, time budget). The agent cannot modify it → it cannot game the metric.train.pyis the agent’s entire playground: architecture, optimizer, hyperparameters, training loop.program.mdis where the human iterates — not on the model, but on how the agent researches. This is meta-programming: you’re shaping the search policy, not the search target.
2.2 The experiment loop (from program.md)
Section titled “2.2 The experiment loop (from program.md)”LOOP FOREVER: 1. Read git state (current branch/commit) 2. Hack train.py with a new idea 3. git commit 4. uv run train.py > run.log 2>&1 ← 5-min wall clock 5. grep "^val_bpb:" run.log ← extract scalar fitness 6. If crashed → tail run.log, maybe fix, else log "crash" and move on 7. Record row in results.tsv 8. If val_bpb improved → keep branch advanced 9. Else → git reset back to last kept state2.3 What each design choice buys you
Section titled “2.3 What each design choice buys you”| Choice | Why it matters |
|---|---|
| Fixed 5-min time budget | Experiments become apples-to-apples regardless of what the agent changes (depth, batch, architecture). Also caps blast radius per trial — ~12 trials/hour. |
| Single writable file | Shrinks the diff surface → reviewable, revertable, less room for the agent to wander off-task. |
Single scalar metric (val_bpb) | No subjective judgment needed. Lower is better. Vocab-size-independent so architectural changes are fair. |
| Git as state machine | commit = accept mutation, reset --hard = reject. Version control is the evolutionary memory. No custom infra needed. |
results.tsv log (untracked) | Long-horizon memory across iterations so the agent doesn’t repeat dead ends. |
| ”NEVER STOP” directive | Removes the default LLM behavior of checking in for permission. This is the single most important line in program.md for overnight autonomy. |
| Simplicity criterion | program.md explicitly rewards deletions that maintain performance. Prevents code bloat / complexity creep over 100+ iterations. |
| Fast-fail (NaN or loss > 100) | The script self-aborts on obviously broken runs — saves time budget for real ideas. |
3. The generalization
Section titled “3. The generalization”Any problem where you can answer all four of these questions can be driven by the same mechanism:
- Can you define a scalar fitness function that you trust? (Lower-is-better or higher-is-better, unambiguous, reproducible.)
- Can you freeze the evaluator so the agent can’t modify it? (Read-only files, pinned dependencies, pinned data split.)
- Can you bound each trial in time/cost? (Wall-clock budget, max tokens, max API calls, max compute $.)
- Can you cheaply rollback a failed trial? (Git, DB snapshots, config diffs, container image tags.)
If yes → the Karpathy loop applies, with the LLM as the mutation operator.
The abstract skeleton
Section titled “The abstract skeleton”┌────────────────────────────────────────────────────────────────────┐│ ││ ┌────────────┐ ┌─────────────┐ ┌─────────────────┐ ││ │ Frozen │ │ Mutation │ │ Agent policy │ ││ │ evaluator │◄─────│ surface │◄──────│ (program.md- │ ││ │ (metric) │ │ (the file │ │ equivalent) │ ││ └─────┬──────┘ │ agent edits)│ └─────────────────┘ ││ │ └──────┬──────┘ ││ │ │ ││ ▼ ▼ ││ ┌──────────────┐ ┌──────────────┐ ││ │ Scalar score │◄────│ Bounded trial│ ││ └──────┬───────┘ └──────────────┘ ││ │ ││ ▼ ││ Improve? ── yes ──► commit (advance) ││ │ ││ └── no ──► reset (discard) ││ ││ Log every trial → persistent memory across iterations │└────────────────────────────────────────────────────────────────────┘4. Use cases this pattern fits
Section titled “4. Use cases this pattern fits”4.1 Prompt & agent optimization (closest analogue)
Section titled “4.1 Prompt & agent optimization (closest analogue)”- Fitness: task accuracy on a held-out eval set (BFCL-style, or a private rubric).
- Frozen: the eval harness + dataset.
- Mutation surface: a single
prompt.mdortools.json. - Trial: one evaluation sweep, capped at N examples or wall-clock.
- Rollback: git.
- Relation to Karpathy’s setup: nearly identical — just swap
train.pyforprompt.md.
4.2 Medical / radiology ML model tuning (directly relevant to your work)
Section titled “4.2 Medical / radiology ML model tuning (directly relevant to your work)”- Fitness: AUC / sensitivity at fixed specificity / Dice score on a frozen validation split of a de-identified in-house dataset.
- Frozen: the split, the pre-processing, the metric code — critically important for regulatory credibility.
- Mutation surface: a single
model.py(architecture + augmentation + loss). - Trial: one training run capped at, say, 15 min on the AI unit’s GPU.
- Rollback: git.
- Caveats unique to this domain:
- You must prevent test-set leakage — the agent must never see test labels. This maps directly onto
prepare.pybeing read-only. - Clinical safety means some “wins” on val may still be rejected on clinical grounds — consider a human gate on
keepdecisions for anything that will touch production. - Log every experiment with commit + metric + model weights’ hash for a defensible audit trail (QMS / IEC 62304 friendly).
- You must prevent test-set leakage — the agent must never see test labels. This maps directly onto
4.3 Database query / index tuning
Section titled “4.3 Database query / index tuning”- Fitness: p95 query latency on a fixed workload.
- Frozen: the workload generator + schema.
- Mutation surface: an
indexes.sqlorplan_hints.yml. - Trial: replay the workload for N seconds.
4.4 Frontend performance budget
Section titled “4.4 Frontend performance budget”- Fitness: Lighthouse score, or TTFB + bundle size as a weighted scalar.
- Frozen: the e2e test + measurement harness.
- Mutation surface: webpack/vite config + critical components.
- Trial: one build + benchmark.
4.5 CI / build-time reduction
Section titled “4.5 CI / build-time reduction”- Fitness: wall-clock time for
make teston a fixed revision. - Frozen: the test suite and the revision under test.
- Mutation surface: build configs, cache hints, Dockerfile layers.
- Trial: one clean build.
4.6 Domain-specific prompt libraries for clinical documentation
Section titled “4.6 Domain-specific prompt libraries for clinical documentation”- Fitness: structured-extraction F1 against a frozen gold set of radiology reports.
- Frozen: gold-annotated reports + scoring script.
- Mutation surface: a single prompt/template file.
5. When this pattern does not work
Section titled “5. When this pattern does not work”Worth being explicit — this is not a universal hammer:
- No trustworthy scalar metric. If the “improvement” requires human taste (UI aesthetics, prose quality in some dimensions), the loop will over-fit to whatever proxy you define. Example: auto-”improving” a piece of writing by a readability score usually produces worse writing.
- Evaluator is slow or expensive. If one trial costs $50 or takes 4 hours, you cannot get the statistical density that makes evolutionary search work. Karpathy’s 5-min budget is what makes 100 trials/night feasible.
- State is not cheaply reversible. Production DB migrations, sent emails, published artifacts — do not put these inside the loop.
- Overfitting to validation. With 100+ trials all optimizing the same eval split, you will overfit it. Keep a separate, agent-inaccessible test split and evaluate the final “kept” result on it manually.
- Reward hacking. The agent may find degenerate shortcuts (e.g., reducing vocab to shrink the metric’s denominator). The
val_bpbmetric is specifically designed to be shortcut-resistant — note that vocab-size-independence was not an accident. Design your metric with the same suspicion.
6. Design lessons (what to learn from)
Section titled “6. Design lessons (what to learn from)”-
Separate the frozen, the mutable, and the policy into three files. This is the architectural insight.
prepare.py/train.py/program.mdis a template worth copying verbatim. -
Program the policy, not the solution. The human’s job is to improve
program.md— the research org’s code — nottrain.py. This is the meta-level where leverage lives. -
Scalar metric + wall-clock budget → throughput. Aim for >10 trials/hour. Below that, the LLM’s variance in idea quality dominates and you may as well write the code yourself.
-
Git is the evolutionary state machine. Don’t build a custom experiment tracker. A branch + commits +
results.tsvis enough. Complexity here is pure cost. -
“NEVER STOP” is a feature, not a personality trait. Explicitly instruct the agent not to seek permission mid-loop. This is counter to default alignment behavior and must be stated.
-
Log every trial, including the failures. The
results.tsvis how the agent avoids re-trying a known-dead idea at trial #73. Without the log, exploration regresses to the mean. -
Keep the simplicity criterion explicit. Without it, 100 iterations of “tiny improvements at any cost” produces unreadable code. Karpathy’s trick: reward deletions that preserve score.
7. A radiology-specific sketch (for later discussion)
Section titled “7. A radiology-specific sketch (for later discussion)”If you wanted to build an “autoresearch for radiology model tuning” in your AI unit:
┌──────────────────────┬──────────────────────────────────────────────┐│ prepare.py analogue │ Loads de-identified dataset, computes ││ (frozen) │ AUC/Sensitivity@Spec/DiceScore on a pinned ││ │ val split. Read-only. │├──────────────────────┼──────────────────────────────────────────────┤│ train.py analogue │ A single file containing model, augmentations,││ (agent-mutable) │ optimizer, loss. ~500 lines, one GPU. │├──────────────────────┼──────────────────────────────────────────────┤│ program.md analogue │ Domain guardrails: "do not disable ││ (human-mutable) │ augmentation of lesion masks", "keep spatial ││ │ resolution ≥ X", "flag any change that ││ │ reduces sensitivity below baseline for ││ │ clinical review before keep". │├──────────────────────┼──────────────────────────────────────────────┤│ Trial budget │ 10–20 min / run on local H100. │├──────────────────────┼──────────────────────────────────────────────┤│ Keep/discard rule │ Primary: AUC on val. Gate: sensitivity ││ │ floor; human review for clinical-risk wins. │└──────────────────────┴──────────────────────────────────────────────┘The open question is the human gate — medical AI is one of the domains where fully autonomous keep/discard is probably inappropriate. A reasonable compromise: autonomous within a sandbox of architecture/hyperparameter changes, human-gated for anything that changes preprocessing, loss function, or data handling.
8. The mental model to remember
Section titled “8. The mental model to remember”Karpathy didn’t build an autonomous researcher. He built an environment in which autonomous research emerges from a standard LLM + git + a scalar metric.
The LLM is the cheap part. The environment design is the expensive, transferable insight — and it ports to almost any optimization-shaped problem where you can name a number and freeze a rule.