2026-06-11

FrontierCode: Changing the Eval Question from 'Is It Correct' to 'Would You Merge It'

Cognition's FrontierCode uses 'would the maintainer actually merge this' as its signal, folding readability, scope discipline, and codebase conventions into the score. Closer to human code review than pass rates, but it drags subjectivity in with it.

evals ai-coding agents

FrontierCode: Changing the Eval Question from 'Is It Correct' to 'Would You Merge It' — Photo / Unsplash

Summary

Cognition, the team behind Devin, has released a new eval called FrontierCode. It does not ask whether a model’s code runs. It asks a sharper question: as the maintainer of this repository, would you actually merge this code into your production branch? Cognition calls it the first benchmark to directly measure mergeability, scoring along correctness, test quality, scope discipline, code style, and adherence to a codebase’s own conventions.

What makes this worth attention is not another leaderboard. For the last couple of years the de facto standard for coding evals has been functional-correctness tests in the SWE-bench mold, where clearing hidden tests counts as a win. FrontierCode’s argument is that once models can routinely produce code that runs, “it runs” stops being a discriminating signal. What is actually scarce, and what actually decides whether AI code reaches production, is whether a human with taste and accountability is willing to own it. Moving the target from “is it right” to “would you take ownership of it” is the judgment in this post that deserves to be taken seriously, even though that target invites a pile of subjectivity through the front door. The subjectivity gets its accounting below.

What happened

FrontierCode’s tasks were not scraped programmatically from individual historical pull requests. They were hand-built by the maintainers of 36 flagship open-source repositories: more than 20 contributors, including core maintainers or leads of projects like Celery, Budibase, uppy, and Mattermost. By Cognition’s account each task absorbed more than 40 hours, with maintainers distilling the judgment they apply when reviewing pull requests into concrete grading criteria: any PR that meets these standards is one they would genuinely approve.

Scoring splits into two kinds. Blockers represent hard stops a maintainer would call during review: not only correctness, but non-correctness requirements like performance and scope limits. Non-blockers are quality signals such as code style, type safety, and readability that would not necessarily stop a merge. A solution counts as passing only if it clears every blocker, and its score is the weighted aggregate of the criteria it satisfies; miss any single blocker and the score drops to zero. That two-layer structure, a veto gate plus weighted scoring, mirrors how real review works, where fatal problems kill a PR outright and everything else is graded on quality.

The benchmark ships as three nested subsets of rising difficulty: Diamond is the hardest 50 tasks, Main the hardest 100 (Diamond included), and Extended the full 150. Each model runs 5 times at every available reasoning effort; the metric is averaged across the 5 trials per effort, and each model is reported at its best-performing reasoning level. The results: Diamond remains unsaturated, with the top performer, Claude Opus 4.8, reaching only 13.4%; GPT-5.5 scores 6.3%, Gemini 3.1 Pro 4.7%, and others lower. Cognition notes that GPT-5.5 uses up to 4x fewer tokens than Opus 4.8, hitting a better cost-intelligence tradeoff. On Main and Extended, Opus 4.8 keeps a clear lead at 34.3% and 51.8%. The gap to open source is large: Kimi K2.6, the best open model, manages 3.8% on Diamond, 16% on Main, 37% on Extended.

One comparison deserves to be pulled out on its own. By analyzing agent trajectories, Cognition reports FrontierCode has an 81% lower false-positive rate than SWE-Bench Pro, meaning cases where incomplete test coverage lets a wrong solution slip through. The opposite failure, false negatives (tests so rigid, checking exact error strings or function names, that they penalize a correct solution), is what the maintainer-designed rubric is meant to catch. METR found earlier that models scoring high on older benchmarks often ship patches human maintainers would reject. If that 81% holds up, it is the real selling point: not “a harder set of problems” but “a score that tracks reality more closely.”

Why it matters

The judgment most worth keeping is this: mergeability is a compound property, and a pass rate is a scalar. Whether code reaches production was never one-dimensional. It has to run, not break existing behavior, clear lint and the build, ship tests that actually cover the intended behavior, stay inside its scope, and conform to the codebase’s design conventions and readability bar. Compressing all of that into a binary pass signal necessarily throws away information. FrontierCode instead breaks these axes apart, scores them separately, and uses the blocker mechanism to restore the reality that some flaws are disqualifying. That is more honest than simply making problems harder or patches larger, and Cognition is explicit that it scaled difficulty through quality rubrics rather than bigger diffs, so FrontierCode’s patches are smaller than DeepSWE’s yet harder to solve.

The second point is about inflated pass rates. When an eval only checks functional correctness, a model can learn to turn the tests green rather than write good code, especially when coverage is thin, where a hazardous solution that happens to fool the tests is genuinely rewarded. FrontierCode’s new mechanisms target exactly this. The reverse-classical criterion requires that the agent’s own tests, when run against the original broken codebase, must fail: a deterministic check that the agent understood the problem rather than wrote an always-green no-op test. The scope check constrains “change only what you must” across file, size, and semantic dimensions, addressing the drive-by refactor that reviewers hate most. Adaptive grading uses a tool called mutagent to let an LLM surgically patch the test environment to match the agent’s implementation details, preserving deterministic rigor without failing a correct solution over surface differences like function names or error wording. Together these are a serious attempt at the old problem that green does not mean good.

The third point needs cold water. This setup formally invites a great deal of subjectivity into the eval. Cognition concedes that rubric design is inherently subjective and demands domain expertise: whether each criterion is a blocker or non-blocker, its weight, and its coverage all rest on maintainer judgment. They offset this with a heavy quality-control pipeline: the task author plays a lazy or adversarial programmer trying to pass with a deliberately wrong solution (guarding false positives), then writes a valid alternative solution to test whether the rubric is too rigid (guarding false negatives), with Devin enlisted to invent fresh ways to game the rubric; then come multi-round pod review, a final Cognition-researcher review, and random tasks the researchers solve themselves. The process is undeniably thorough. But thoroughness only moves the question of who defines good code from a single annotator up to a group of maintainers plus Cognition researchers; it does not dissolve it. It makes subjective judgment more consistent and traceable, not objective. That is real progress, but it should not be sold as “we finally measured code quality objectively.”

Builder impact

If you are choosing a coding agent for a team, FrontierCode’s scorecard is worth a look over SWE-bench pass rates, but look in the right place. The signal is not who ranks first (Opus 4.8 leads all three tiers); it is the absolute level. The strongest model reaches just 13.4% on Diamond, meaning that in genuinely high-standard production codebases, every frontier model today is still weak at producing PRs you can merge as-is. That is a different world from the high pass rates you see on lightweight tasks. It recalibrates expectations: agents writing runnable code is now table stakes, but writing code you would willingly own is far from solved, and the human-review gate is not going away soon.

Do not skip the cost-efficiency thread either. Cognition specifically flags that GPT-5.5 uses up to 4x fewer tokens than Opus 4.8, achieving a better cost-intelligence tradeoff at slightly lower quality. For teams that will actually wire an agent into CI and pay per call across thousands of runs, quality-per-dollar often decides selection more than the top score. FrontierCode’s data explorer past the first figure breaks out pass@5, token, dollar, and step cost by reasoning level, which is more actionable than the headline number.

Know the boundaries of this leaderboard, though. A point pressed repeatedly on Hacker News is harness sensitivity: the reported numbers run on “house” harnesses (codex with GPT, Claude Code with Opus), and a team member admits some models do better on non-house harnesses. The same model on a different scaffold can move a fair amount. That is not a flaw unique to FrontierCode, but it means you cannot lift its ranking straight onto your own toolchain. The safest use is to treat FrontierCode as a yardstick for how hard this problem really is, not as the final answer to which API to buy; that answer still has to be measured on your own codebase and your own scaffold.

What to ignore

Do not read “would you merge it” as an objective score. One Hacker News commenter argued bluntly that since we cannot agree on or measure code quality for humans, we should doubt measuring it for LLMs. That goes too far, since you do not need universal consensus to measure something and any chosen set of quality measures constitutes a benchmark, but its direction is right. FrontierCode measures whether this particular set of maintainers would merge into their particular repos, not some universal code quality. Swap the maintainers or the codebase and the blockers and weights change. Treating its score as a model’s general, absolute “code goodness” is a misuse.

Do not over-index on rank gaps or absentees either. Some on Hacker News questioned whether reporting each model at its best reasoning effort introduces a best-of-N multiple-comparisons bias, and others asked why there are no error bars (the team noted only 50 unique problems and pass@5 reporting, saying you would need 50+ runs for a credible interval). These are fair methodology critiques, but they barely touch your actual decision. They change rankings after the decimal point, not the big conclusion that the strongest model sits in the single digits to low tens. Likewise the debates over why certain cheaper Chinese models were left out or what counts as a frontier model are coverage questions; let later versions settle them rather than calling it now.

Research impact

For people who build evals, FrontierCode’s most useful contribution is its stance on subjectivity: not pretending to dodge it, but pinning subjective judgment down into reproducible criteria through adversarial testing, calibration, and multi-stage review. The reverse-classical trick in particular, proving a test is meaningful by requiring it to fail on the broken code, is a clean, deterministic, portable idea worth borrowing elsewhere. Also notable is its prompt design: task descriptions are deliberately written like a human request and run about a third the length of SWE-Bench Pro’s, forcing the agent to infer the maintainer’s intent rather than being spoon-fed an over-specified prompt. That reflects a judgment: frontier models no longer need so much hand-holding, and evals should stop overfeeding them.

But the design tradeoffs are where research caution belongs. Cognition says it will not publicly release the tasks for now to avoid contamination (a team member on Hacker News said they will release them later, slightly at odds with the blog, which leaves this point uncertain), opening the eval only to model creators. So in the near term, outside researchers cannot independently reproduce or audit its rubrics, and an eval that puts subjective judgment at its core is exactly the kind that most needs external audit. Treating it as one of the closest things we have to an engineering-realistic coding eval is reasonable; treating it as independently verified objective truth is premature.

Sources

No official primary source available; this analysis is based on reliable secondary reporting (named outlets, cross-confirmed).