AI pair programming: when it helps and when the diff tells a different story

An evidence-led look at where AI coding assistance genuinely accelerates good work, and where the data shows it introducing noise.

The honest assessment of AI coding assistance is that it helps on some tasks and introduces problems on others, and the industry has not been rigorous about distinguishing between the two. Vendor claims focus on velocity. Sceptical counter-claims focus on quality degradation. Both are describing real phenomena, applied to different types of work.

Where AI assistance holds up in the data

The tasks where AI assistance consistently produces good results are reasonably well understood: boilerplate code with stable, well-specified requirements; test generation for functions with clear inputs and outputs; documentation and comments; converting code between formats or languages; straightforward function implementations in well-established patterns.

What these tasks share is that the problem is fully specified before the prompt is written, the model has strong training signal for the type of output required, and the developer can verify the output quickly against a clear spec. In these cases, the churn rates on AI-assisted commits tend to be close to human-authored rates.

Where the diff tells a different story

The tasks where AI assistance tends to underperform are those requiring context the model doesn’t have. Complex refactors across multiple files, new abstractions that need to fit coherently into an existing architecture, code that interacts with systems the model has no context for, and anything where the right solution depends on implicit knowledge about the codebase and its history.

In these cases, AI assistance produces code that is locally plausible but globally wrong: it solves the stated problem without respecting the unstated constraints. The diff looks reasonable. The code passes review. The churn comes later, when the code turns out not to fit the way it needed to.

This is the pattern behind the 2× average churn rate for AI-assisted commits in GitClear’s research. The average is being pulled upward by the high-churn cases, which tend to cluster around exactly these kinds of complex, context-dependent tasks.

The calibration question

The productive framing for developers is not whether to use AI assistance, but on which tasks to use it and at what level of oversight. A developer who uses AI assistance on simple, well-specified functions and reviews the output quickly is getting the velocity benefit with low quality risk. A developer who uses AI assistance on complex architectural work and reviews it with the same level of attention is taking on substantially more quality risk.

The data to answer this question for your specific work is in your git history. Developers whose AI-assisted commits churn at close to the rate of their human-authored commits are using the tools in a way that’s working. Developers whose AI-assisted commits churn significantly faster have useful information about where to apply more scrutiny, or where to rely on the tools less.

Why the binary framing doesn’t help

The “AI is good” versus “AI is bad” debate has been running for two years and hasn’t settled because it’s asking the wrong question. The data shows that AI assistance produces good results on specific types of tasks and worse results on others, that the quality outcomes vary significantly across teams and individuals, and that the teams getting good results are the ones with visibility into which category their use falls into.

Treating AI assistance as a single thing to be adopted or rejected doesn’t account for any of this variation. Treating it as a set of tools with different risk profiles on different types of tasks, and measuring the results, does.

Scryable shows you where your AI-assisted commits are holding up and where they’re churning, compared to your own human-authored baseline. Get early access.