Autoresearch: AI That Does ML Research While You Sleep
Give an AI agent a real LLM training setup. Let it experiment autonomously overnight. Wake up to a better model and a log of everything it tried.
The Autonomous Researcher
What if you could hand an AI a real machine learning codebase, go to sleep, and wake up to find it had run 100 experiments - keeping the wins, discarding the losses, and steadily pushing toward a better model?
That is exactly what Andrej Karpathy's autoresearch does. The project gives a coding agent (like Claude or Codex) a small but real LLM training setup - a simplified, single-GPU implementation of nanochat - and sets it loose to iterate autonomously.
Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began.
- @karpathy, March 2026The setup is deliberately minimal. The agent modifies a single Python file, runs training for a fixed 5-minute budget, checks if the validation metric improved, and either keeps or reverts the change. Then it does it again. And again. Approximately 12 experiments per hour, roughly 100 over a night's sleep.
The core insight is a shift in what you, the human, actually do. You are not
writing Python anymore. You are writing program.md - a Markdown
file that instructs the AI agent on how to be a researcher. You are programming
the researcher, not the research.
Three Files, One Idea
The repo is deliberately kept small. Only three files matter, each with a clearly defined role and ownership boundary:
This separation is what makes the experiment trustworthy. The evaluation
function in prepare.py is the immovable ground truth. The agent cannot
game the metric by changing how it is calculated. It can only improve the model.
The metric itself is val_bpb (validation bits per byte) - lower is better. Because it is calculated in bits per byte rather than per token, it remains comparable even if the agent changes the vocabulary size or tokenizer configuration.
The Experiment Loop
Once set up, the agent enters an infinite loop. Each iteration follows the same disciplined protocol:
-
Hypothesize. The agent examines the current state of
train.py, the experiment history, and comes up with an idea - change the learning rate, try a different activation function, adjust the model depth, etc. -
Edit and commit. It modifies
train.pyand creates a git commit. This gives a clean checkpoint to revert to if needed. -
Train for 5 minutes. It runs
uv run train.py. The training always runs for exactly 5 minutes of wall-clock time, regardless of what the agent changed. -
Evaluate. It reads the output metric:
val_bpb. Lower is better. -
Keep or discard. If val_bpb improved, the agent advances the
branch and keeps the commit. If it got worse or stayed the same, it reverts
with
git reset. -
Log and repeat. Results go into
results.tsv, and the loop starts again. The agent never stops to ask if it should continue.
A critical detail: the agent is instructed to never stop. Once the loop begins, it does not pause to ask the human anything. The human might be asleep. The agent is fully autonomous - if it runs out of ideas, it is expected to think harder, reread the code for new angles, try combining near-misses, or attempt more radical changes.
The Fixed Time Budget
Every experiment runs for exactly 5 minutes of wall-clock training time (excluding startup and compilation overhead). This is the single most important design decision in the project, and it has two key consequences:
Experiments are directly comparable
Because the time budget is fixed, every change the agent makes is evaluated on equal footing. It does not matter if the agent doubles the model size, halves the batch size, or swaps the architecture entirely - the question is always the same: in 5 minutes of training, how good is this model?
Hardware-specific optimization
The flip side is that results are not comparable across different hardware. An H100 will complete far more training steps in 5 minutes than an RTX 3060. This means autoresearch naturally finds the best model for your specific hardware. A smaller, faster-training model might beat a larger one on weaker hardware simply by fitting more optimizer steps into the budget.
This is not a bug - it is a feature. The MLX fork's results demonstrate this beautifully: on Apple Silicon, reducing depth from 8 to 4 dropped val_bpb from 2.533 to 1.808 because the smaller model completed many more training steps in the same time window.
Programming the Program
The most philosophically interesting file in the repository is program.md.
It is a Markdown document - not code - that defines the agent's entire research
protocol. Karpathy describes it as a "super lightweight skill."
The default program.md is intentionally bare-bones. It tells the agent
how to set up, what files to read, how to run experiments, and how to log results.
But the real power is in how you would iterate on it:
- Research strategy. You could guide the agent toward specific research directions - "focus on optimizer improvements this run" or "explore architectural changes systematically."
-
Multi-agent setups. You could define multiple agents with
different
program.mdfiles, each exploring a different axis of the search space. - Meta-optimization. Over time, you iterate on the program itself, finding the "research org code" that achieves the fastest research progress. You are optimizing the optimizer.
What the Agent Can and Cannot Do
The boundaries are deliberately tight, making the setup both safe and focused:
Can Do
- Modify train.py (architecture, optimizer, hyperparameters, anything)
- Change model size, batch size, depth, width
- Swap activation functions, attention patterns
- Restructure the training loop
Cannot Do
- Modify prepare.py (evaluation is sacred)
- Install new packages
- Change the evaluation harness
- Exceed the 5-minute time budget
This constraint set is the key to why autoresearch works as an autonomous system. The agent has real freedom to explore - anything about the model and training is fair game - but it cannot break the rules of the experiment itself.
Notable Forks
The original autoresearch targets a single NVIDIA GPU (tested on H100). But the community quickly ported it to other platforms, each fork solving different hardware challenges. These forks are one of the most interesting outcomes of the project - they show how the same autonomous research loop adapts to wildly different compute environments.
What the Forks Reveal
The MLX fork's results are particularly illuminating. Overnight runs on different Apple Silicon chips converged on different optimal configurations:
- M4 Max machines settled on AdamW-only (no Muon), low matrix learning rate, 3x MLP ratio, and no logit cap.
- Mac Mini (longer run) favored Muon optimizer, sharper attention, smaller MLP, and lower scalar learning rate - a meaningfully different recipe.
Some Mac Mini findings did not transfer cleanly to the M4 Max baseline. This is exactly the kind of hardware-specific behavior the fixed-time-budget design is useful for uncovering. Each platform finds its own optimal point in the architecture-hyperparameter space.
Advice for Smaller Hardware
Karpathy provides explicit guidance for running on smaller compute. The key adjustments for consumer hardware:
- Use a lower-entropy dataset like TinyStories (GPT-4 generated short stories) for reasonable results with small models
- Reduce
vocab_size(down to 4096, 2048, or even 256 for byte-level) - Lower
MAX_SEQ_LEN(even to 256) and compensate with largerDEVICE_BATCH_SIZE - Reduce
DEPTH(the primary complexity knob) from 8 to 4 - Use
WINDOW_PATTERN = "L"instead of"SSSL"(alternating banded attention is inefficient on weaker hardware) - Lower
TOTAL_BATCH_SIZE(keep it powers of 2, down to ~16K)
Why This Matters
Autoresearch is a small project - three files, one GPU, one metric. But the idea it demonstrates is significant.
Traditional ML research is a human-in-the-loop process: think of an idea, implement it, run the experiment, analyze results, think of the next idea. Each cycle might take hours or days of human attention. Autoresearch collapses the entire cycle into a 5-minute automated loop. The human's role shifts from "doing experiments" to "designing the process that does experiments."
The layering is what makes it work:
- prepare.py provides immutable ground truth (you cannot game the metric)
- train.py gives the agent real creative freedom (anything about the model is fair game)
- program.md lets the human steer without micromanaging (define the process, not the steps)
- Git provides the safety net (every experiment is a commit, every failure is a revert)
The result is something like a very fast, very focused research assistant that never gets tired, never forgets to log results, and will keep running experiments as long as you let it. It is not AGI. It is not replacing researchers. But it is a compelling demonstration of what happens when you give a capable coding agent a well-structured problem and get out of the way.