We trained SmolLM2-360M-Instruct on UltraFeedback preference
pairs using Karpathy's autoresearch
— a Claude agent that edits a training script, runs experiments, and iterates
against a held-out preference-accuracy metric (val_pref_acc).
The agent ran 50 experiments and produced many checkpoints, rejecting most
because they didn't beat the starting recipe; we kept 2 of its best, within
a small val range, since the agent didn't push the metric far. A researcher
then asked the agent to inspect what it had tried and look for gaps. The
agent's response — LoRA adapters and high-margin data filtering, categories
it hadn't explored on its own — produced 2 more recipes.
Prolific annotators ran all 10 pairwise head-to-heads on 50 general-audience
prompts (1507 ratings).
The metric and humans disagreed even at the bottom of the
leaderboard. By val_pref_acc, the agent's autonomous
recipes sit below the untrained reference's chance level of 0.500 —
Recipe A (val=0.464) and Recipe B
(val=0.492, the agent's committed best).
The metric says they made the model slightly worse than no training.
Humans saw it the other way: Recipe A won 52.7% of head-to-heads
against the untrained base, Recipe B won 52.1% — both
statistically indistinguishable from chance, but pointing the opposite
way from the metric. The Bradley-Terry global ranking puts the untrained
base last, below all four trained recipes.
The researcher-prompted recipes won decisively. Once
the researcher asked the agent to look for gaps, the agent proposed LoRA
adapters + high-margin data filtering. The resulting recipes (C and D)
won over the untrained base at 66.4% and 59.7% respectively —
the only DPO recipes in the study with confidence intervals clear of
chance. Spearman ρ between val_pref_acc and the human-preference
rank across the four trained recipes is +0.80.
But at the top the metric and humans disagree: the highest-scoring recipe
(agent + researcher Recipe D · deeper LoRA + filtered data, val=0.648) is
not the one humans prefer most (agent + researcher Recipe C · LoRA + filtered data,
val=0.628).
HITL moved the result at two stages. Research-time: the researcher's meta-prompt was the difference between regression and improvement. Evaluation-time: Prolific annotators disambiguated the top two recipes the metric couldn't tell apart. Without either step, we'd have shipped a noticeably worse model.
Three stages, three contributors. Each tried to make the model better; each had a blind spot the next stage filled in.
The autoresearch agent edited the training script, launched DPO runs, and iterated across 50 experiments (45 produced valid output; 5 failed). It rejected most checkpoints because they didn't beat the starting recipe. We kept 2 of its best (0.464–0.492). Both stayed below the untrained reference's chance level (0.500) — the agent's autonomous search did not lift the metric over no training at all.
A researcher prompted the agent to inspect its own trajectory and identify what it hadn't tried. The agent's response added LoRA adapters and score-margin-filtered UltraFeedback — categories of change it hadn't explored on its own. The same DPO loop with those additions produced Recipes C and D, lifting val_pref_acc to 0.628–0.648.
Five models × 10 pairs × 50 prompts × 3 annotators = 1507 pairwise judgements from 305 participants. Annotators saw responses blind — no labels, no provenance — and picked the one they preferred or called it a tie.
Five models. The pill next to each name shows which stage produced it.
| Recipe | val_pref_acc | What it is |
|---|---|---|
| untrainedOriginal model
base |
0.500 | SmolLM2-360M-Instruct as published — no extra training. |
| agentRecipe A · default DPO
prod-baseline |
0.464 | DPO at default knobs (β=0.1, lr=5e-7, 1 epoch over a sample of UltraFeedback). One of the agent's earlier checkpoints, kept as a reference. |
| agentRecipe B · agent's pick
prod-adambeta2 |
0.492 | The recipe the autoresearch agent settled on: constant_with_warmup LR + NEFTune + adam_β₂=0.95. The best of the agent's many checkpoints by val_pref_acc. |
| agent + researcherRecipe C · LoRA + filtered data
prod-bestbet |
0.628 | Came from a researcher-prompted re-run: the researcher asked the agent to inspect its own trajectory and find what it was missing. The agent proposed LoRA (rank 32) + UltraFeedback filtered to high-margin preference pairs. |
| agent + researcherRecipe D · deeper LoRA + filtered data
prod-bestbet-v2 |
0.648 | Same as Recipe C with LoRA rank 64 (larger adapter — more capacity to adapt to UltraFeedback's chosen style). |
If val_pref_acc were a perfect predictor of human preference,
the four trained recipes would line up monotonically along the dotted "chance"
line — higher score, more often preferred over the untrained base. Here's
what 1507 pairwise ratings actually show:
Points colored by provenance: agentagent recipes (A, B) · agent + researcheragent + researcher recipes (C, D). Dotted lines mark chance: horizontal = 50% human preference vs base, vertical = val_pref_acc of the untrained reference.
Three things stand out:
So the headline isn't "DPO works." It's: the metric and humans disagree about every recipe — at the bottom (where the metric undervalues the agent's autonomous runs) and at the top (where the metric picks D but humans pick C).
agent + researcher Recipe C · LoRA + filtered data and agent + researcher Recipe D · deeper LoRA + filtered data are the same recipe with one knob changed: LoRA rank 32 vs 64 (D has more adapter capacity). By the metric, D is the better model — val_pref_acc rises from 0.628 to 0.648. By humans, D is the worse model — Bradley-Terry strength drops from 1.17 to 1.11.
Tracking three steps along the val_pref_acc axis shows how the metric's signal changes character as it climbs:
| From → To | Δ val_pref_acc | Human pref vs base | Interpretation |
|---|---|---|---|
| agent Recipe A · default DPO → agent Recipe B · agent's pick | +0.028 | 53% → 52% | Small metric gain, no clear effect on humans. |
| agent Recipe B · agent's pick → agent + researcher Recipe C · LoRA + filtered data | +0.136 | 52% → 66% | Big metric gain, clearly visible to humans. |
| agent + researcher Recipe C · LoRA + filtered data → agent + researcher Recipe D · deeper LoRA + filtered data | +0.020 | 66% → 60% | Small metric gain, humans reverse. |
The pattern: the metric is useful for finding the right neighborhood and even tracks human preference within it. At the very top it starts pointing at the wrong door.
UltraFeedback's "chosen" labels — what val_pref_acc is measured
against — come from GPT-4 acting as a judge. So the metric
measures one thing precisely: how often does our policy agree with GPT-4
about which response is better?
GPT-4-as-judge has well-documented systematic biases. It prefers longer responses, more structured ones (bullets, headers), and ones that look thorough and helpful (preambles, comprehensive lists, hedging). Within the sweet spot, optimizing toward GPT-4 is useful — humans agree GPT-4-style structure is often clearer. Past the sweet spot, the model becomes too GPT-4-like: too long, too bulleted, too hedged. That's where humans tap out.
agent + researcher Recipe D · deeper LoRA + filtered data has more capacity to mirror UltraFeedback's chosen style — the larger LoRA adapter is what tips it past the sweet spot. Same recipe, more parameters, higher metric score, worse model by humans.
How often humans preferred the lower-val_pref_acc model, by prompt category. Higher rate = the metric is less reliable in this category (humans more often went the other way).
| Category | Lower-val wins / total | Disagreement rate |
|---|---|---|
| emotional tone | 24 / 40 | 60% |
| pedagogical clarity | 33 / 66 | 50% |
| creative writing | 31 / 66 | 47% |
| sensory descriptive | 18 / 41 | 44% |
| persuasive opinion | 26 / 63 | 41% |
| common factual | 18 / 46 | 39% |
| light planning | 12 / 33 | 36% |
| personal advice | 21 / 72 | 29% |
The metric works reasonably on instructional / advice prompts — the closest match to UltraFeedback's training distribution. It fails hardest on emotional tone and pedagogical clarity, where humans want warm, concise, naturally-flowing language and the metric pushes toward structured-looking thoroughness.
Length isn't the whole story either. Higher-val responses are 1.39× longer than lower-val ones in scenarios where humans agreed with the metric, and 1.23× longer in scenarios where humans disagreed. Humans aren't blanket-rejecting longer responses — they reject extra length on prompts where it doesn't earn its keep.
The metric outputs a single number. Humans wrote comments and showed taste at the per-scenario level. Examples below: scenarios where humans preferred the lower-val_pref_acc response — click to expand prompt + both responses.
856 non-empty comments. Comments clustered thematically by an LLM (claude-sonnet-4-6). Each theme groups comments by the quality dimension annotators reacted to, not by surface keywords.
The picture isn't "the metric is broken." val_pref_acc
points in the right direction (Spearman ρ = +0.80) — it's a fine
signal to search with. The failure mode is specific: near the top
it stops being able to distinguish recipes that are statistically separable
by humans, because at that scale it's measuring "GPT-4-likeness" more than
"humans like this."
The agent's 50-experiment run stayed within standard full-fine-tune DPO and never proposed LoRA or data filtering on its own. The unlock wasn't more compute or a different agent — it was a researcher asking the agent a different question: "look at what you've tried and find what's missing." The agent's response to that prompt was LoRA + data filtering. The search space an agent explores is shaped as much by how a researcher frames it as by the agent's loop itself; periodic meta-prompts — "what aren't you trying?" — are how you widen it.
agent + researcher Recipe C · LoRA + filtered data and agent + researcher Recipe D · deeper LoRA + filtered data are statistically separable by humans but inverted by the metric. 1507 Prolific pairwise ratings — a few hundred dollars of annotation — flipped the production decision. That's the standard preference-eval setup: humans don't have to agree individually for the aggregate to be informative, and the aggregate is what you ship from.
Not just one stage of the loop:
The combination is what produced the actual best recipe: agent for fast search within a defined space, researcher to widen that space, Prolific to pick the winner at the top, researcher again to understand what won.
The detail behind the summary — per-pair stats, ranking, agreement, category breakdown, methodology.
All 10 unique pairs, 50 prompts × 3 annotators = 150 ratings each. "tracks" = higher-val recipe won at > 50% with Wilson 95% CI excluding chance. "flips" = lower-val recipe won. "noisy" = CI overlaps 50%.
| Pair (X vs Y) | Δ val_pref_acc | X wins | Ties | Y wins | Higher-val win rate (95% CI) | Signal |
|---|---|---|---|---|---|---|
| untrained Original vs agent Recipe B | -0.008 | 58 | 31 | 63 | 47.9% [39.2, 56.8] | noisy |
| untrained Original vs agent Recipe A | -0.036 | 62 | 22 | 69 | 47.3% [39.0, 55.8] | noisy |
| untrained Original vs agent + researcher Recipe C | +0.128 | 41 | 30 | 81 | 66.4% [57.6, 74.2] | tracks |
| untrained Original vs agent + researcher Recipe D | +0.148 | 50 | 27 | 74 | 59.7% [50.9, 67.9] | tracks |
| agent Recipe B vs agent + researcher Recipe C | +0.136 | 62 | 22 | 65 | 51.2% [42.6, 59.7] | noisy |
| agent Recipe B vs agent + researcher Recipe D | +0.156 | 67 | 26 | 58 | 46.4% [37.9, 55.1] | noisy |
| agent Recipe A vs agent Recipe B | +0.028 | 55 | 32 | 64 | 53.8% [44.8, 62.5] | noisy |
| agent Recipe A vs agent + researcher Recipe C | +0.164 | 48 | 26 | 73 | 60.3% [51.4, 68.6] | tracks |
| agent Recipe A vs agent + researcher Recipe D | +0.184 | 53 | 21 | 77 | 59.2% [50.6, 67.3] | tracks |
| agent + researcher Recipe C vs agent + researcher Recipe D | +0.020 | 57 | 31 | 62 | 52.1% [43.2, 60.9] | noisy |
Combines all pairwise outcomes into a single ranking. Higher BT strength = humans preferred this recipe more often across the whole study.
| Rank | Recipe | BT strength | val_pref_acc |
|---|---|---|---|
| 1 | agent + researcher Recipe C | 1.173 | 0.628 |
| 2 | agent + researcher Recipe D | 1.113 | 0.648 |
| 3 | agent Recipe B | 1.043 | 0.492 |
| 4 | agent Recipe A | 0.862 | 0.464 |
| 5 | untrained Original | 0.809 | 0.500 |
Annotators within a single pair frequently disagreed — taste is heterogeneous. Low κ tells you taste varies at the per-scenario level; it doesn't invalidate the aggregate signal, which is averaged over hundreds of ratings.
| Pair | Fleiss' κ | Agreement | n items |
|---|---|---|---|
| untrained Original vs agent Recipe B | κ = +0.152 | poor | 46 |
| untrained Original vs agent Recipe A | κ = +0.242 | fair | 47 |
| untrained Original vs agent + researcher Recipe C | κ = +0.167 | poor | 48 |
| untrained Original vs agent + researcher Recipe D | κ = +0.078 | poor | 49 |
| agent Recipe B vs agent + researcher Recipe C | κ = +0.050 | poor | 47 |
| agent Recipe B vs agent + researcher Recipe D | κ = +0.067 | poor | 49 |
| agent Recipe A vs agent Recipe B | κ = +0.081 | poor | 49 |
| agent Recipe A vs agent + researcher Recipe C | κ = +0.182 | poor | 47 |
| agent Recipe A vs agent + researcher Recipe D | κ = +0.244 | fair | 49 |
| agent + researcher Recipe C vs agent + researcher Recipe D | κ = +0.138 | poor | 50 |
How the preference pattern shifts by prompt type. Higher-val win rate < 50% = humans systematically picked the lower-scoring recipe in that category.
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 3/2/10 | 23% |
| untrained Original vs agent Recipe A | 12/1/2 | 86% |
| untrained Original vs agent + researcher Recipe C | 6/2/8 | 57% |
| untrained Original vs agent + researcher Recipe D | 3/3/9 | 75% |
| agent Recipe B vs agent + researcher Recipe C | 10/1/4 | 29% |
| agent Recipe B vs agent + researcher Recipe D | 11/1/4 | 27% |
| agent Recipe A vs agent Recipe B | 1/3/11 | 92% |
| agent Recipe A vs agent + researcher Recipe C | 3/1/11 | 79% |
| agent Recipe A vs agent + researcher Recipe D | 3/1/11 | 79% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 7/3/5 | 42% |
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 10/3/11 | 48% |
| untrained Original vs agent Recipe A | 13/6/5 | 72% |
| untrained Original vs agent + researcher Recipe C | 12/4/9 | 43% |
| untrained Original vs agent + researcher Recipe D | 10/6/8 | 44% |
| agent Recipe B vs agent + researcher Recipe C | 15/3/6 | 29% |
| agent Recipe B vs agent + researcher Recipe D | 13/3/8 | 38% |
| agent Recipe A vs agent Recipe B | 3/7/14 | 82% |
| agent Recipe A vs agent + researcher Recipe C | 7/6/11 | 61% |
| agent Recipe A vs agent + researcher Recipe D | 5/10/9 | 64% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 12/4/8 | 40% |
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 5/3/7 | 42% |
| untrained Original vs agent Recipe A | 3/2/10 | 23% |
| untrained Original vs agent + researcher Recipe C | 5/1/9 | 64% |
| untrained Original vs agent + researcher Recipe D | 6/2/7 | 54% |
| agent Recipe B vs agent + researcher Recipe C | 8/1/6 | 43% |
| agent Recipe B vs agent + researcher Recipe D | 4/6/5 | 56% |
| agent Recipe A vs agent Recipe B | 6/4/5 | 45% |
| agent Recipe A vs agent + researcher Recipe C | 12/0/2 | 14% |
| agent Recipe A vs agent + researcher Recipe D | 9/2/4 | 31% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 2/3/10 | 83% |
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 7/2/3 | 70% |
| untrained Original vs agent Recipe A | 4/1/7 | 36% |
| untrained Original vs agent + researcher Recipe C | 4/1/7 | 64% |
| untrained Original vs agent + researcher Recipe D | 5/3/4 | 44% |
| agent Recipe B vs agent + researcher Recipe C | 4/3/5 | 56% |
| agent Recipe B vs agent + researcher Recipe D | 6/3/3 | 33% |
| agent Recipe A vs agent Recipe B | 5/1/6 | 55% |
| agent Recipe A vs agent + researcher Recipe C | 4/3/5 | 56% |
| agent Recipe A vs agent + researcher Recipe D | 3/1/9 | 75% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 2/3/7 | 78% |
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 11/6/6 | 65% |
| untrained Original vs agent Recipe A | 6/5/14 | 30% |
| untrained Original vs agent + researcher Recipe C | 4/11/9 | 69% |
| untrained Original vs agent + researcher Recipe D | 12/5/8 | 40% |
| agent Recipe B vs agent + researcher Recipe C | 11/2/12 | 52% |
| agent Recipe B vs agent + researcher Recipe D | 11/3/10 | 48% |
| agent Recipe A vs agent Recipe B | 13/3/8 | 38% |
| agent Recipe A vs agent + researcher Recipe C | 6/8/10 | 62% |
| agent Recipe A vs agent + researcher Recipe D | 9/1/14 | 61% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 17/1/6 | 26% |
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 8/9/8 | 50% |
| untrained Original vs agent Recipe A | 15/4/6 | 71% |
| untrained Original vs agent + researcher Recipe C | 3/6/15 | 83% |
| untrained Original vs agent + researcher Recipe D | 4/5/15 | 79% |
| agent Recipe B vs agent + researcher Recipe C | 5/5/13 | 72% |
| agent Recipe B vs agent + researcher Recipe D | 7/7/10 | 59% |
| agent Recipe A vs agent Recipe B | 7/6/11 | 61% |
| agent Recipe A vs agent + researcher Recipe C | 5/4/15 | 75% |
| agent Recipe A vs agent + researcher Recipe D | 6/4/14 | 70% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 8/8/8 | 50% |
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 9/2/11 | 45% |
| untrained Original vs agent Recipe A | 5/3/14 | 26% |
| untrained Original vs agent + researcher Recipe C | 3/4/14 | 82% |
| untrained Original vs agent + researcher Recipe D | 5/2/14 | 74% |
| agent Recipe B vs agent + researcher Recipe C | 5/2/13 | 72% |
| agent Recipe B vs agent + researcher Recipe D | 4/3/14 | 78% |
| agent Recipe A vs agent Recipe B | 12/6/3 | 20% |
| agent Recipe A vs agent + researcher Recipe C | 10/3/7 | 41% |
| agent Recipe A vs agent + researcher Recipe D | 12/1/8 | 40% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 7/4/10 | 59% |
| Pair | X/tie/Y | Higher-val win rate |
|---|---|---|
| untrained Original vs agent Recipe B | 5/4/7 | 42% |
| untrained Original vs agent Recipe A | 4/0/11 | 27% |
| untrained Original vs agent + researcher Recipe C | 4/1/10 | 71% |
| untrained Original vs agent + researcher Recipe D | 5/1/9 | 64% |
| agent Recipe B vs agent + researcher Recipe C | 4/5/6 | 60% |
| agent Recipe B vs agent + researcher Recipe D | 11/0/4 | 27% |
| agent Recipe A vs agent Recipe B | 8/2/6 | 43% |
| agent Recipe A vs agent + researcher Recipe C | 1/1/12 | 92% |
| agent Recipe A vs agent + researcher Recipe D | 6/1/8 | 57% |
| agent + researcher Recipe C vs agent + researcher Recipe D | 2/5/8 | 80% |
Karpathy's autoresearch uses
val_bpb — held-out cross-entropy in bits per byte — because
it's pretraining. Our task is DPO fine-tuning, so the natural analog is
val_pref_acc: on a held-out preference set, what fraction
does the policy "agree" with — i.e., assign higher implicit reward to the
chosen response than the rejected one. Both metrics measure
fit-to-held-out-data. The question for both is the same: does fit to the
validation set track what people actually want?
| Karpathy's autoresearch | This study | |
|---|---|---|
| Task | Pretraining (LM) | DPO fine-tuning |
| Held-out data | Validation web text | UltraFeedback test_prefs |
| Metric | val_bpb ↓ | val_pref_acc ↑ |
| Interpretation | How well does it predict text? | How often does it agree with the preference labels? |
temperature=0.7, top_p=0.9,
repetition_penalty=1.05.val_pref_acc joined per row) is
published on Hugging Face — see the article for the link.