Ask an engineering manager how they know their AI tools are working. The answer usually comes in one of three forms.

Developers say they’re faster. The team is shipping more features. The productivity numbers look good.

All three are real observations. None of them is a measurement of AI impact.

What gut feel is actually measuring

Developer self-report on AI productivity is measuring something real: the subjective experience of reduced friction in the short loop of writing code. When a tool autocompletes a function correctly, when it rewrites a block on the first pass, when it scaffolds a component that would have taken an hour — those experiences are genuine. They reduce cognitive load. They make certain tasks faster.

What self-report does not capture is what happens to that code after it is written. Whether it churns. Whether it introduced a subtle issue that surfaces in review two weeks later. Whether the function it generated is a near-duplicate of three others already in the repository.

The developer who wrote the code is not tracking those downstream events and connecting them back to the tool session that produced it. Neither is the standup. Neither is the velocity dashboard.

The gap in the data

Two numbers from recent research sit in uncomfortable proximity to each other.

63% of developers now use AI coding tools weekly. Trust in AI-generated code dropped to 29% in 2025.

Those figures describe the same population. Developers are using tools they do not fully trust, at scale, because the tools are useful in the moment even when the output requires careful review. That is a rational position — the time savings are real, and good review processes should catch the problems.

What it means in practice: a portion of AI-generated code is going through review with appropriate scrutiny, and a portion is moving more quickly than it should. Both are happening simultaneously in most engineering teams. Neither is visible from a productivity metric.

What the code-level research shows

The longitudinal data from GitClear, covering a large sample of repositories over several years, is the most comprehensive published research on AI’s effect on code quality at scale. Two findings are particularly relevant to evaluation.

AI-assisted commits churn at approximately twice the rate of human-authored commits. Churn here means lines committed and then rewritten or deleted within a short window — code that did not survive contact with the codebase. In raw terms: for every ten lines of AI-assisted code committed, roughly twice as many of those lines are overwritten quickly compared to human-authored code.

Duplicate code blocks have increased four-fold in high-adoption codebases. AI generation tends toward pattern completion rather than abstraction, which means similar functions get written in multiple places rather than being factored into shared utilities.

Both are lagging signals. They accumulate over months. They do not appear in a sprint velocity chart. By the time they are visible as a maintenance burden, they have often been compounding for a significant period without being attributed to the AI workflow that produced them.

Why the current evaluation methods miss this

Velocity measurement captures output before it is tested against the codebase. A commit is counted when it is made, not when it survives its first refactor or turns out to duplicate existing functionality. Feature completion tracks delivery to a defined spec, not whether the implementation added debt that will constrain future work.

None of that is a criticism of those metrics. They measure what they are designed to measure. The problem is that they are being used to answer a question they were not designed for: whether AI tools are improving the quality, not just the volume, of engineering output.

Quality lives in different numbers. Churn ratio. Net lines added. Duplicate code density. The trajectory of those metrics before and after AI adoption, compared against a consistent baseline. Those are the measurements that can actually answer the question.

What measurement requires

It does not require a new instrumentation process or changes to how developers work. It does not require opt-in from the team or modifications to the development workflow.

It requires reading the git history with intention.

Your pre-AI baseline is already in your repository. Every commit before your team adopted AI tools has a timestamp, an author, and a set of file changes. The churn data is recoverable from those commits. The net lines trajectory is there. The velocity pattern before and after adoption can be read directly from the history.

What most teams lack is a tool that reads that history with these specific questions in mind — drawing the line at the adoption date and comparing the signals that matter across both sides of it.

The teams that are getting this right

The research on teams showing the strongest AI productivity results — 40–60% velocity increases with stable or improving quality signals — points to a common factor. They know specifically which task types and workflows are a good fit for AI generation, and they have adjusted accordingly. They are not running at maximum AI usage across all work; they are running at calibrated usage based on where the tools produce durable output.

That calibration requires knowing the difference between productive AI use and high-volume-low-survival AI use. The only way to know the difference is to measure both sides of it.

Gut feel is not wrong. It is just not measurement. The teams that are running on it are making resource decisions — about which tools to licence, how to structure review processes, how to think about technical debt — with genuinely incomplete information.

The information needed to do better is in the git history. It has been there the whole time.