In the high-stakes world of software development, managers and executives often turn to quantifiable metrics like commits, lines of code, and velocity points to gauge developer productivity. Yet these "flawed five" metrics fundamentally miss the mark, incentivizing poor practices and ignoring the collaborative, outcome-driven nature of modern engineering.
This overreliance on output-focused numbers creates a false sense of progress, gamifies workflows, and undermines team effectiveness. As teams chase higher counts, they produce bloated codebases, inflated estimates, and neglected quality ultimately slowing delivery of real value. In this article, we'll dissect why these metrics fail, explore their perverse incentives, and outline better alternatives grounded in industry consensus.
The Flawed Five: Metrics That Measure Effort, Not Impact
Industry experts have long criticized a core set of productivity proxies. Coined the "flawed five" by engineers like Jaana Dogan, these include commits, lines of code, pull requests, velocity points, and vague notions of code impact. Each promises simplicity but delivers distortion.
1. Commits: Quantity Over Quality
Counting commits per day seems harmless—more activity must mean more productivity, right? Wrong. A single commit can represent a massive refactor or a trivial fix, while developers squash commits before pushing, making counts arbitrary and incomparable across individuals.
Unique commit patterns further invalidate this metric. One developer might batch changes into fewer, thoughtful commits; another scatters tiny ones. As Dogan notes,
"The number of commits doesn’t tell you anything about the size, value, or quality of those commits."
Result? Teams game the system with micro-commits, wasting time on process over product.
2. Lines of Code: Rewarding Bloat
Lines of code (LOC) has haunted metrics for decades, despite rebukes from luminaries like Martin Fowler and Bill Atkinson. More lines don't equal more value efficient code is concise, readable, and maintainable.
Refactoring, a hallmark of good engineering, reduces LOC, punishing the very practices that prevent technical debt. Developers can inflate counts by verbose comments or duplicated logic, but this breeds unmaintainable spaghetti code.
"More lines of code does not equate to more value delivered," warns DX's analysis.
IBM's research echoes this: early defect detection via reviews trumps sheer volume.
3. Pull Requests: Volume Without Value
Number of pull requests (PRs) feels modern but shares LOC's flaws. A tiny hotfix counts the same as a feature overhaul. Like commits, PRs ignore size, complexity, or business impact.
This metric discourages large, meaningful changes in favor of bite-sized PRs, fragmenting work and slowing integration. Developers "hate it," per LeadDev, because it prioritizes quantity over thoughtful collaboration.
4. Velocity Points: Estimates Turned Toxic
Velocity points from Agile tools like Jira estimate effort pre-work, not delivered value. Using them post-sprint to rank productivity backfires spectacularly.
Estimates are inherently inaccurate, and tying rewards to points encourages inflation: teams assign higher values to "complete" more, rendering planning useless. Allen Holub laments how Jira's sprint points "destroy the careers of highly productive people who... [tackle] very hard problems." Velocity measures busyness, not outcomes.
5. Code Impact: Snake Oil in Disguise
Code impact tools analyze changes by lines added/deleted, edit proximity, and type claiming to quantify influence. But as any coder knows, this lacks legitimacy.
Refactors or deletions (often positive) score low, while churn racks up "impact." The name alone invites misuse, signaling managers to rank individuals on meaningless math. Developers resent it, and it fails to correlate with business results.
These metrics share a fatal flaw: they track output (effort exerted) rather than outcomes (value created). As Swarmia observes,
"don’t only fail to capture what matters, but actively make things worse by creating incentives for gaming the system."
Why These Metrics Backfire: Perverse Incentives and Systemic Blind Spots
Beyond inaccuracy, flawed metrics distort behavior. Rewarding individuals breaks collaboration software is a team sport where reviews, integrations, and handoffs define success.
- Gaming the system: Developers pad LOC with whitespace, split PRs artificially, or inflate points.
- Short-termism: Focus on counts neglects testing, refactoring, or automation investments that yield long-term gains.
- Individual vs. team: Metrics like McKinsey's Developer Velocity Index (DVI) or contribution analysis target solo stars, ignoring how "the unit of measurement that actually matters is the team."
- Measurement changes behavior: As Pragmatic Engineer notes,
"The act of measurement changes how developers work, as they try to 'game' the system."
Customers and executives care about profitability, not commit tallies.
Even "advanced" frameworks falter. McKinsey's benchmarks measure effort (e.g., surveys, backlog contributions), not downstream impact like revenue. Google's technical debt study tested 117 metrics none worked reliably, proving some concepts demand human judgment.
Better Alternatives: Focus on Systems, Flow, and Humans
Ditch output proxies for system-level metrics that reveal bottlenecks and outcomes. Prioritize DORA's elite performers: high deployment frequency, low lead time for changes, fewer changes failed, and quick mean time to recovery. These correlate with business speed without gamification.
Key Recommendations
- Cycle Time and Flow Efficiency: Track total delivery time vs. active work. If a feature takes weeks but only days of effort, fix waits and handoffs.
- Team Delivery Metrics: Measure working software shipped, not points.
- SPACE Framework: Blend Satisfaction, Performance, Activity, Communication, and Efficiency via surveys and data. It warns against LOC pitfalls.
- Human-Centric Insights: System metrics like CI times need cleaning; surveys capture nuance (e.g., technical debt).
"Data collected from humans can be... objective," per Fowler.
| Flawed Metric | Core Problem | Better Alternative |
|---|---|---|
| Commits/PRs | Ignores value/quality | Deployment frequency |
| Lines of Code | Rewards bloat | Cycle time |
| Velocity Points | Inflated estimates | Flow efficiency |
| Code Impact | Arbitrary math | DORA recovery time |
Slow down to speed up: allocate time for automation and refactoring, but align org-wide (product, sales, CEO). Tools like Swarmia or Axify help surface flow issues without individual blame.
The Bigger Picture: Productivity as a Team and Organizational Property
Developer productivity thrives in aligned systems, not siloed scores. Frustrations like broken deploys or slow reviews are fixable; cultural misalignments demand leadership buy-in.
McKinsey-style indexes risk imposition if leaders dodge measurement better to own DORA/SPACE proactively. As Fowler emphasizes, no perfect metric exists; blend data with human insight for validity.
In 2026, with AI tools accelerating code gen, output metrics grow even more obsolete. Focus on leverage: how devs amplify business outcomes via efficient teams.
Engineering leaders: audit your dashboards. Replace flawed five with flow and outcomes. Your velocity might dip initially, but sustainable speed follows. Teams deliver when measured holistically not hustled into hollow highs.