HomePodcast

AI and Developer Productivity: Insights from a 100,000-Developer Stanford Study

Yegor shares insights from large-scale studies on developer output, why early AI productivity claims were overstated, and what engineering leaders should (and shouldn’t) measure when rolling out AI across the software development lifecycle.
January 15, 2026
AI
Hosted by
Ankit Jain
Co-founder at Aviator
Guest
Yegor Denisov-Blanch
Researcher

About Yegor Denisov-Blanch

Yegor helps software engineering teams make better decisions with data.

Currently he is a researcher at Stanford University, where he researches the impact of AI on engineering productivity. Previously, Yegor led digital transformation at DHL and was a national champion Olympic weightlifter.

Developer Output, not Developer Productivity

We usually avoid the word productivity and instead talk about developer output. The core hypothesis behind our work is that almost everything a software engineering team does eventually shows up in the source code in some form. Even activities that happen before or after coding—like design discussions or debugging—tend to leave traces in the codebase.

There are lots of existing metrics that sit upstream or downstream of code, such as surveys, DORA metrics, lines of code, commits, or pull requests. They’re useful in certain contexts, but they also leave a black box around what actually happens inside the code itself. Our goal was to open that box.

What we do differently is take actual code changes and have panels of senior engineers—people with real context on the repository—evaluate them. They assess things like how complex the problem was, how long it likely took to implement, and how maintainable or high-quality the change is. We then train machine-learning models to replicate those expert judgments at scale

Measuring the Before and After ChatGPT

We began the research just before the ChatGPT moment in late 2022, but for the first several months, AI tools didn’t show a clear productivity impact. Early versions of tools like GitHub Copilot were closer to advanced autocomplete than true assistants.

Because we integrate directly with Git history, we can run before-and-after analyses and also compare similar teams—one using AI and one not—using statistical techniques. Early on, the signal was drowned out by noise. That changed over time.

Today, the median productivity lift we see from AI is around 10–15%, which is much lower than some widely cited numbers like the 60% gains reported in early industry studies.

But averages hide something important: teams that know how to use AI well often see 20–30% improvements, while teams that don’t may see almost nothing—or even negative effects.

AI Slows Down Developers in the Beginning

There’s a learning curve. For the first 30, 60, or even 100 hours of AI usage, many developers actually slow down.

It takes time to understand which tasks AI is good at and which ones it’s not. Once that mental model clicks, productivity starts to improve.

Another counterintuitive finding is where the gains show up. The biggest impact isn’t in writing code itself. It’s in the stages before and after coding—understanding the codebase, interpreting requirements, debugging, QA, and navigating complex systems. Those areas seem to offer much more low-hanging fruit for AI.

Measure Teams, not Individuals

If you’re a staff or principal engineer, much of your value comes from enabling others. That creates attribution problems, because Git typically assigns a single author to a change even if many people contributed.

Because of that, we strongly discourage using our models to evaluate individuals in isolation. They’re much more meaningful at the team level. Teams have different ratios of enablers to coders, different product maturity levels, and different constraints.

If a team is moving slowly, that doesn’t automatically mean the team is underperforming. It could be external dependencies, business constraints, or deliberate trade-offs between speed and quality. 

All our methodology can really tell you is what is happening—not why. Understanding the why requires digging deeper.

Who are Ghost Engineers?

A ghost engineer is someone whose role is to write code full-time but whose output is extremely low, around 10–20% of the median or less. In our data, we found that roughly 9% of engineers fell into this category, with slightly higher numbers in remote settings.

It’s important to say this carefully. There’s likely sampling bias, because companies that work with us are often already worried about productivity. And I suspect you’d see similar patterns in other knowledge-work professions.

What’s striking isn’t that disengagement exists, but how invisible it often is. Many of the engineers we spoke to described a slow slide into disengagement—effort not being recognized, seeing others underperform without consequences, and gradually checking out. The common thread was a lack of transparency and feedback, not laziness.

Best Practices for Meaningful AI Adoption

Don’t just track AI usage; track outcomes. Counting prompts or tool adoption isn’t enough. How AI is used matters far more than how often it’s used.

Apply AI across the entire software development lifecycle, not just inside the IDE. Planning, specs, CI/CD, debugging—those areas often deliver more value than code generation alone.

Treat AI like a scientific experiment. Companies that successfully adopt AI form hypotheses, measure results, and stay disciplined about where AI helps and where it hurts. The challenge is that capabilities change every few months, so learning never really stops.

What we’re seeing in the data is a growing gap between teams that learn quickly and those that don’t. The “rich get richer” effect is real. Eventually, that gap will close once playbooks emerge, but I don’t expect that to happen anytime soon.

AI Is a Powerful Tool, but It’s Not Magic

My one message for engineering leaders trying to make sense of productivity metrics and AI would be not to over-index on any single metric or claim. AI is a powerful tool, but it’s not magic, and it’s certainly not evenly distributed yet. Metrics should be used to ask better questions, not to draw simplistic conclusions.

And maybe most importantly: slower isn’t always worse. Sometimes moving 30% slower is the right trade-off if it avoids costly failures later. Productivity only matters in the context of business outcomes.

Ready to transform your development workflow?

Transform scattered processes into reliable, collaborative Runbooks.

Join us at The Hangar

A vetted community for developer-experience (DX) enthusiasts.
Learn More