Karpathy / March 2026

Autoresearch: AI That Does ML Research While You Sleep

Give an AI agent a real LLM training setup. Let it experiment autonomously overnight. Wake up to a better model and a log of everything it tried.

March 2026 15 min read 3 interactive figures

The Autonomous Researcher

What if you could hand an AI a real machine learning codebase, go to sleep, and wake up to find it had run 100 experiments - keeping the wins, discarding the losses, and steadily pushing toward a better model?

That is exactly what Andrej Karpathy's autoresearch does. The project gives a coding agent (like Claude or Codex) a small but real LLM training setup - a simplified, single-GPU implementation of nanochat - and sets it loose to iterate autonomously.

Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began.

- @karpathy, March 2026

The setup is deliberately minimal. The agent modifies a single Python file, runs training for a fixed 5-minute budget, checks if the validation metric improved, and either keeps or reverts the change. Then it does it again. And again. Approximately 12 experiments per hour, roughly 100 over a night's sleep.

The core insight is a shift in what you, the human, actually do. You are not writing Python anymore. You are writing program.md - a Markdown file that instructs the AI agent on how to be a researcher. You are programming the researcher, not the research.

Three Files, One Idea

The repo is deliberately kept small. Only three files matter, each with a clearly defined role and ownership boundary:

Fixed

prepare.py

Data prep, tokenizer training, dataloader, and the evaluation function. This is the ground truth - nobody touches it.

Agent edits

train.py

The full GPT model, optimizer (Muon + AdamW), and training loop. Architecture, hyperparameters, batch size - everything is fair game.

Human edits

program.md

Instructions for the agent: how to set up experiments, what to try, when to keep or discard. The "research org code."

Figure 1 - Repository Architecture

Highlight:

The human writes program.md to instruct the agent. The agent edits only train.py. prepare.py provides fixed evaluation - the ground truth that neither party can touch. Click the highlight modes to see each role.

This separation is what makes the experiment trustworthy. The evaluation function in prepare.py is the immovable ground truth. The agent cannot game the metric by changing how it is calculated. It can only improve the model.

The metric itself is val_bpb (validation bits per byte) - lower is better. Because it is calculated in bits per byte rather than per token, it remains comparable even if the agent changes the vocabulary size or tokenizer configuration.

The Experiment Loop

Once set up, the agent enters an infinite loop. Each iteration follows the same disciplined protocol:

Hypothesize. The agent examines the current state of train.py, the experiment history, and comes up with an idea - change the learning rate, try a different activation function, adjust the model depth, etc.
Edit and commit. It modifies train.py and creates a git commit. This gives a clean checkpoint to revert to if needed.
Train for 5 minutes. It runs uv run train.py. The training always runs for exactly 5 minutes of wall-clock time, regardless of what the agent changed.
Evaluate. It reads the output metric: val_bpb. Lower is better.
Keep or discard. If val_bpb improved, the agent advances the branch and keeps the commit. If it got worse or stayed the same, it reverts with git reset.
Log and repeat. Results go into results.tsv, and the loop starts again. The agent never stops to ask if it should continue.

Figure 2 - The Experiment Loop

Click "Run Experiment" to start

Each experiment follows the same cycle. Click "Run Experiment" to simulate the agent iterating through hypothesize, edit, train, evaluate, and keep/discard. Watch how the branch advances on wins and reverts on losses.

A critical detail: the agent is instructed to never stop. Once the loop begins, it does not pause to ask the human anything. The human might be asleep. The agent is fully autonomous - if it runs out of ideas, it is expected to think harder, reread the code for new angles, try combining near-misses, or attempt more radical changes.

The simplicity criterion. All else being equal, simpler is better. A tiny improvement that adds ugly complexity is not worth it. But a tiny improvement from deleting code? Definitely keep. The agent is encouraged to value elegance alongside raw metric improvement.

The Fixed Time Budget

Every experiment runs for exactly 5 minutes of wall-clock training time (excluding startup and compilation overhead). This is the single most important design decision in the project, and it has two key consequences:

Experiments are directly comparable

Because the time budget is fixed, every change the agent makes is evaluated on equal footing. It does not matter if the agent doubles the model size, halves the batch size, or swaps the architecture entirely - the question is always the same: in 5 minutes of training, how good is this model?

Hardware-specific optimization

The flip side is that results are not comparable across different hardware. An H100 will complete far more training steps in 5 minutes than an RTX 3060. This means autoresearch naturally finds the best model for your specific hardware. A smaller, faster-training model might beat a larger one on weaker hardware simply by fitting more optimizer steps into the budget.

This is not a bug - it is a feature. The MLX fork's results demonstrate this beautifully: on Apple Silicon, reducing depth from 8 to 4 dropped val_bpb from 2.533 to 1.808 because the smaller model completed many more training steps in the same time window.

Figure 3 - val_bpb Progress Over Experiments

Baseline: 2.667 val_bpb

A simulated overnight run. Each dot is one 5-minute experiment. Green dots are improvements (kept), red are regressions (discarded), gray are crashes. The blue line tracks the current best. Click "Add Experiment" to step through, or "Auto-run" to simulate an overnight session.

Programming the Program

The most philosophically interesting file in the repository is program.md. It is a Markdown document - not code - that defines the agent's entire research protocol. Karpathy describes it as a "super lightweight skill."

The default program.md is intentionally bare-bones. It tells the agent how to set up, what files to read, how to run experiments, and how to log results. But the real power is in how you would iterate on it:

Research strategy. You could guide the agent toward specific research directions - "focus on optimizer improvements this run" or "explore architectural changes systematically."
Multi-agent setups. You could define multiple agents with different program.md files, each exploring a different axis of the search space.
Meta-optimization. Over time, you iterate on the program itself, finding the "research org code" that achieves the fastest research progress. You are optimizing the optimizer.

What the Agent Can and Cannot Do

The boundaries are deliberately tight, making the setup both safe and focused:

Can Do

Modify train.py (architecture, optimizer, hyperparameters, anything)
Change model size, batch size, depth, width
Swap activation functions, attention patterns
Restructure the training loop

Cannot Do

Modify prepare.py (evaluation is sacred)
Install new packages
Change the evaluation harness
Exceed the 5-minute time budget

This constraint set is the key to why autoresearch works as an autonomous system. The agent has real freedom to explore - anything about the model and training is fair game - but it cannot break the rules of the experiment itself.

Notable Forks

The original autoresearch targets a single NVIDIA GPU (tested on H100). But the community quickly ported it to other platforms, each fork solving different hardware challenges. These forks are one of the most interesting outcomes of the project - they show how the same autonomous research loop adapts to wildly different compute environments.

macOS / MPS

miolini/autoresearch-macos

Adds Apple Silicon support via Metal Performance Shaders (MPS). Removes the FlashAttention-3 dependency, falling back to PyTorch's native SDPA with manual sliding window masking. Includes MPS-specific optimizations: disabling unsupported torch.compile paths, lowered memory batch sizes for Metal bounds, and precise optimizer state casting.

macOS / MLX

trevin-creator/autoresearch-mlx

A native Apple Silicon port using MLX instead of PyTorch entirely. No CUDA dependency. Overnight runs on M4 Max reached 1.294 val_bpb. Demonstrated that smaller models beat larger ones on Apple Silicon because they fit more optimizer steps into the fixed 5-minute budget.

Windows / RTX

jsegov/autoresearch-win-rtx

Targets consumer NVIDIA GPUs on Windows. Removes torch.compile and FA3 fast path for SDPA with eager execution. Supports Turing through Blackwell architectures with a tiered VRAM floor policy. Uses TinyStories dataset by default for practical consumer-GPU training.

AMD / ROCm

andyluo7/autoresearch

Brings autoresearch to AMD GPUs via ROCm. Extends the platform matrix beyond the NVIDIA ecosystem, allowing AMD GPU owners to participate in autonomous ML research.

What the Forks Reveal

The MLX fork's results are particularly illuminating. Overnight runs on different Apple Silicon chips converged on different optimal configurations:

M4 Max machines settled on AdamW-only (no Muon), low matrix learning rate, 3x MLP ratio, and no logit cap.
Mac Mini (longer run) favored Muon optimizer, sharper attention, smaller MLP, and lower scalar learning rate - a meaningfully different recipe.

Some Mac Mini findings did not transfer cleanly to the M4 Max baseline. This is exactly the kind of hardware-specific behavior the fixed-time-budget design is useful for uncovering. Each platform finds its own optimal point in the architecture-hyperparameter space.

Advice for Smaller Hardware

Karpathy provides explicit guidance for running on smaller compute. The key adjustments for consumer hardware:

Use a lower-entropy dataset like TinyStories (GPT-4 generated short stories) for reasonable results with small models
Reduce vocab_size (down to 4096, 2048, or even 256 for byte-level)
Lower MAX_SEQ_LEN (even to 256) and compensate with larger DEVICE_BATCH_SIZE
Reduce DEPTH (the primary complexity knob) from 8 to 4
Use WINDOW_PATTERN = "L" instead of "SSSL" (alternating banded attention is inefficient on weaker hardware)
Lower TOTAL_BATCH_SIZE (keep it powers of 2, down to ~16K)

Why This Matters

Autoresearch is a small project - three files, one GPU, one metric. But the idea it demonstrates is significant.

Traditional ML research is a human-in-the-loop process: think of an idea, implement it, run the experiment, analyze results, think of the next idea. Each cycle might take hours or days of human attention. Autoresearch collapses the entire cycle into a 5-minute automated loop. The human's role shifts from "doing experiments" to "designing the process that does experiments."

The layering is what makes it work:

prepare.py provides immutable ground truth (you cannot game the metric)
train.py gives the agent real creative freedom (anything about the model is fair game)
program.md lets the human steer without micromanaging (define the process, not the steps)
Git provides the safety net (every experiment is a commit, every failure is a revert)

The result is something like a very fast, very focused research assistant that never gets tired, never forgets to log results, and will keep running experiments as long as you let it. It is not AGI. It is not replacing researchers. But it is a compelling demonstration of what happens when you give a capable coding agent a well-structured problem and get out of the way.