What if Code Review Happened Before the Code was Written?
Agents already write code faster than teams can review it. The way through is to move the critical decisions upstream, into specs that humans review before any code is written and acceptance criteria that agents can verify after.

AI agents can write code faster than any team can review it. If you haven’t hit that reality yet, then you’re not sweating the models enough. The code review bottleneck is the new rate limiter in the SDLC, and faster horses can’t solve it; we need to create the airplane for code reviews.
We ran an experiment to test a different approach: what if the review happens before the code is written?
We implemented a medium-scoped software task with 0 lines of manually written code, guided entirely by a spec. Then we measured what happened when the bread met the butter, that is, when that code met the old-fashioned review process.
But do we just blindly trust the agent will produce the right code?
There should be a trustworthy way of generating and reviewing code with agents, and we are betting on spec-driven verification. A crucial aspect of spec-driven verification is the use of acceptance criteria, which are requirements developed and agreed upon by the team before any code is written. The acceptance criteria being met is a necessary and sufficient condition for the code to be merged with confidence.
This experiment was our first test of that loop: humans review the spec, an agent implements it, and a second agent verifies the output against the acceptance criteria.
The Scope of the Task
We decided to use what we considered a medium-scoped task: adding a hierarchical, per-repository configuration system for Runbooks.
It spanned the full stack, including a database model and migration for config history, a schema/validation layer, a resolution engine that merges repo and global configs, a GraphQL API (types + mutation) for reading and updating configs, integration of the resolved config into all runtime subsystems (sandbox provisioning, CI handling, code generation, chat processing, persona selection), and a frontend settings page with a config editor, change history view, and source indicators showing where each setting originates.
How We Guided the Agent
All the information we gave the agent was a spec. The spec was created by providing a scaffolding PRD to Claude Code and instructing it to ask clarifying questions, which were answered synchronously by the team.
The review of the generated spec went beyond the typical architectural back-and-forth. We engaged at two distinct levels.
The first was implementation-level detail: comments questioned specific UI component choices, flagged the need for input validation on string fields, and suggested performance strategies like Redis caching for session configuration lookups. These are the kinds of concerns that usually surface during code review, not spec review.
The second was scope and completeness: the team caught a missing UX requirement around displaying the current configuration state when users start or modify a session, debated whether the personas feature was well-defined enough to keep in scope, and flagged places where the spec text was out of sync with the actual design (e.g., referencing a table that didn’t exist).
Upstream Leverage
The Spec Driven Development guides argue that the ratio of hours spent on planning to hours saved on rework is 1:10, and we agree. Every one of these spec comments was a code review comment that never had to be written.
An implementation detail like “validate this string field” is a one-line addition to a spec but a round-trip code review comment and patch after the fact. A scoping question like the personas discussion, if left unresolved, could lead to building a feature that gets ripped out or redesigned. Spec reviews front-load these decisions so that the agent writes the right code the first time.
We ended up with a spec that included detailed designs for the configuration schema, database design, and GraphQL API. The guidance on the front end consisted mostly of which elements should exist and what purpose they had. A wireframe illustration of how the UI should look was also created. The spec wasn’t just instructions for the agent, but also a contract that could be verified against: 14 acceptance criteria containing 65 checkable items were agreed upon.
Limiting Agent
We used Claude Code Opus 4.5 to implement the spec. The agent decided to split the implementation into four phases: foundation, backend core, integration, and frontend. The total implementation took 5 hours.
The first two phases were clean: well-scoped tasks on a fresh context. By phase three, the agent ran out of context window space, auto-compacted, and instead of continuing where it left off, rewrote its own task list and restarted the phase. We believe this model limitation can be solved with bigger context windows and agent orchestration.
The whole implementation consisted of around 6k lines of code, with 40% app code, 40% tests, and 20% GraphQL auto-generated files. It’s worth noting that no explicit instruction to write tests was given in the specification, which suggests the agent’s strategy for meeting the acceptance criteria was to write tests.

Verification Phase
We used another agent to verify the 65 acceptance criteria items against the pull requests. It took six minutes and produced a structured report with file references and explanations for each item: 60 passed, 4 failed, and 1 was considered partial. A human doing the same verification would have needed hours. This is the scaling mechanism we’re looking for: if the spec is precise enough, verification can be automated.
The team reviewed the pull requests generated, and these are the most interesting findings:
- The team left an average of 10 comments per PR, which is reasonable for the scope of this project.
- Few bugs were found, with the major one being a stale editor state.
- Reviewers consistently pushed back on unnecessary abstractions, redundant helpers, and patterns that didn’t match codebase conventions.
The spec review caught design-level issues such as a missing UX requirement and an underspecified feature. These would have been expensive to fix post-implementation. Code review caught convention-level issues such as import placement, enum duplication, and naming patterns. These were cheap to fix but numerous; 13 of 38 reviewer comments were style issues.
OpenAI’s team observed a similar tendency in their AI development experiment: agents replicating patterns, even suboptimal ones. In our case, the problem went further. The agent didn’t just copy suboptimal patterns; it created new ones that contradicted existing conventions. A human developer would learn to avoid these errors soon after joining a team, but without explicit guidance, an agent keeps making them. Addressing this is a matter of engineering the harness, as Mitchell Hashimoto puts it.
What’s next?
This experiment worked well enough to change how we think about the problem. A single agent implemented a full-stack feature guided entirely by a spec, and a second agent verified the acceptance criteria in minutes.
We have yet to find out how far this scales. Our spec took two days of team time to write and review. That investment paid off for a medium-scoped, well-bounded feature, but we don’t know if the same approach works for ambiguous, exploratory work where the spec can’t carry the full intent.
We’re still learning how to write specs that are precise enough for agents to implement reliably but flexible enough that the team doesn’t spend more time specifying than they would have spent coding. The code review bottleneck won’t be solved by reading code faster.
Agents already write code faster than teams can review it, and that gap will only widen. The way through is to move the critical decisions upstream, into specs that humans review before any code is written and acceptance criteria that agents can verify after.









