Contents
What Pensero Is
Pensero is an AI-native engineering performance platform. It automatically collects data from your engineering tools—Git providers, ticketing systems, docs, AI coding tools, coverage, and calendars—and turns it into real-time, explained metrics across your whole organization. No manual data gathering, no spreadsheet wrangling.
Where most tools show one dimension (commits, or tickets, or cycle time), Pensero integrates them into a single picture and explains what changed and why, rather than leaving you to investigate across five dashboards.
The questions it answers
Every metric maps to a leadership conversation. The ones you’ll reach for most:
Conversation | Metrics that drive it |
Are we delivering enough? | Delivery per HC, Active headcount |
Is code quality declining? | Defect rate, Rework rate, Code coverage |
Why are projects taking so long? | Cycle time, Time to merge, Waste rate |
Is our AI investment working? | AI-assisted %, Delivery lift, Tokens per delivery |
Are we too siloed? | Collaboration ratio, Knowledge gaps, PR pairing rate |
Are we strategic or reactive? | Roadmap alignment, New stuff %, KTLO % |
Do we have succession risks? | Knowledge gaps, Talent density |
How do we compare to others? | Benchmark percentiles across all metrics |
Reading metrics in context
Numbers without context mislead. The same figure can be healthy or alarming depending on the team—a team at 30% KTLO work may be fine as a platform team, concerning as a product team, and critical at an early-stage startup. Pensero supplies that context through benchmarks by company stage and team type, historical trends, and pattern recognition. The What Pensero Measures section that follows spells out the ranges and the healthy-vs-warning combinations to watch.
Principle: data informs decisions; it doesn’t make them. Use it to drive supportive conversations and systemic improvements, never to police individuals.
Navigating Pensero
Three controls do most of the work: the date range (what period), the scope filter (which people), and the navigation menu (which metric). Set the first two and every page obeys them. This section is the how-to for getting to the right view fast.
Date range
The selector sits at the top of every metrics page: arrows step backward and forward by the current period, and the center dropdown changes the period type—Week (Mon–Sun), Month, Quarter, or an organizational Cycle if configured. For ad-hoc windows, pick “Days to date” (a rolling last-N-days), “Months to date,” or a custom start/end (up to 365 days). It persists as you move between pages, survives refreshes, and lives in the URL—so you can bookmark or share a specific period. All dates use your org’s timezone, and most charts automatically show “vs. previous period” (the same-length window immediately before).
“Week” vs. “Days to date: 7”: Week is a fixed Mon–Sun you can step between; Days-to-date is a rolling last-7-days that always tracks today. Use Week to compare week over week, Days-to-date for an always-current snapshot.
Scope filter — managers and executives
The filter (funnel icon, left nav) scopes all content to specific people. Managers see their reporting line; executives see the whole org; individual contributors don’t have it and see only their own data. Open it and pick from four tabs—People, Teams, Cohorts, or a unified All search—then Apply, and every chart and table updates. The button reads “All N people” when clear and “N people matched” when active.
The filter persists across pages and sessions (stored ~30 days, per user, per device—so it doesn’t sync across machines and resets if you clear your browser or use incognito). “Clear selection” inside the modal removes it. One sharing caveat worth knowing: the date range travels in the URL, but the filter does not—to give a colleague the same scoped view, save the cohort as Organization-scoped and share its name.
Cohorts — reusable groups
A cohort is a saved group defined by rules—“Senior ICs,” “Backend engineers on product teams,” “new hires under six months.” Build it once and it re-evaluates automatically as people’s attributes change. In the Cohorts tab, “Add cohort” opens a builder: conditions within a rule group are AND-ed (role = IC and level in senior/staff), and separate rule groups are OR-ed (one group or another). You can also always-include or always-exclude named people regardless of rules. Attributes include people, teams, role, level, technology, location, tenure, employment type, and work mode. A live preview shows who matches as you build.
Save it as Personal (only you) or Organization (everyone can apply it, read-only unless they own it). Cohorts are the time-saver for anything recurring—two minutes to build, reused across every review thereafter.
Finding your way around
The left sidebar groups pages into Intelligence (Signals, Delivery, Quality, Efficiency, Reviews, AI Impact, Scope of Work, Talent), Organization (People, Work, Calibrate for execs, Benchmark), and Settings. Most metric pages share a shape: a Summary tab with hero cards (a headline number, a trend arrow, and “vs. previous period”) and charts, a People tab with the per-person table, and detail tabs for specific work types (PRs, tickets, documents). Breadcrumbs at the top (e.g. Delivery › People › Alice) click back to any level.
Reading the visuals: a green trend arrow means moving in the good direction for that metric, red the bad, gray no change—so “down” is green for defect rate. Quadrant scatter plots put one metric on each axis with a dot per person; top-right is strong on both, bottom-left needs a look, the off-diagonals are trade-offs. Hover any chart for exact values; click a dot to jump to that person.
The People table
The People table is the per-person view of every metric, reachable from each section’s People tab and from Organization › People. It respects the active filter and date range, so it always reflects the group and period you’ve set. Sort any column to find the top or bottom of a metric, search by name, click a row to drill into that person’s detail page, or export to CSV (top-right) for spreadsheet analysis—the export includes every column, even those hidden in the UI, plus an export timestamp.
Which columns appear, the default sort, and available groupings are set org-wide in Settings › Organization settings, not per person. Managers see their reporting line; executives see everyone; an IC sees only themselves.
ICs get their own self-view: the same metrics about their own work, plus an anonymized comparison—percentile standing against their level, team, or company (“higher than 73% of level-3 engineers”) and quadrant charts where only their own dot is named. They cannot see other individuals’ data or the People table. Worth knowing when an engineer references their own numbers, though it isn’t a manager-facing surface.
Practical habits
A few patterns pay off: start broad (all people, full team), spot the outlier, then filter in to investigate. Pair a date range with a filter for focused questions—“backend team, last quarter” is one period and one cohort away. Bookmark the views you return to (the URL carries page, tab, and date range). To compare two groups side by side, use the Calibrate page rather than flipping filters. And if a number looks stale, a hard refresh (Ctrl/Cmd-Shift-R) forces fresh data.
What Pensero Measures
31 metrics across 8 categories. Each table gives the definition and which direction is better. Direction is a guide, not a goal—read every metric against the context above and the patterns at the end of this section.
Delivery — output and capacity
Metric | What it measures | Direction |
Total delivery | Story points completed in a period. Absolute output volume; use for capacity and workload. | ↑ higher |
Delivery per headcount | Average weekly delivery per active engineer. Normalizes output across team sizes. | ↑ higher |
Active headcount | Engineers with completed work in the period. Team size, not performance. | — |
Quality — code quality and technical excellence
Metric | What it measures | Direction |
Defect rate | Share of delivery spent fixing bugs the team introduced. Direct quality signal. | ↓ lower |
Rework rate | Share of delivery later rewritten. Churn, shifting requirements, or debt. | ↓ lower |
Revert rate | Share of delivery rolled back. Production stability and risk. | ↓ lower |
Duplicate code rate | Share of delivery with duplicated patterns. Debt and maintainability. | ↓ lower |
Code coverage | Share of code with automated tests. Testing discipline. Needs coverage integration. | ↑ higher |
Efficiency — speed and bottlenecks
All time-based efficiency metrics use P90 (90th percentile) to focus on typical bottlenecks, not outliers.
Metric | What it measures | Direction |
Cycle time | P90 ticket-assignment to PR-merge. End-to-end delivery speed. Needs ticketing integration. | ↓ lower |
Time to merge | P90 PR-creation to merge. Review-process speed after code is written. | ↓ lower |
Time to approve | P90 PR-creation to first approval. Review responsiveness. | ↓ lower |
Time to comment | P90 PR-creation to first comment. Earliest responsiveness signal. | ↓ lower |
Waste rate | Share of PRs closed without ever merging—abandoned, superseded, or stale work. | ↓ lower |
AI Impact — adoption and effectiveness
Requires AI tool integration (Cursor, GitHub Copilot, etc.).
Metric | What it measures | Direction |
AI-assisted % | Share of code lines written with AI assistance. Adoption and leverage. | ↑ higher |
AI cost | Total spend on AI coding tools, and per-engineer. Read against output and quality gains. | context |
Tokens per delivery | AI tokens consumed per delivery point. Usage efficiency. | ↓ lower |
Delivery lift | Output multiplier vs. a prior period (e.g. 1.4×). Productivity change alongside AI use. | ↑ higher |
User adoption | Share of active engineers using AI tools at all. Org-wide reach. | ↑ higher |
Collaboration (Reviews) — review quality and knowledge sharing
Metric | What it measures | Direction |
Collaboration ratio | Share of delivery on enablement (reviews, pairing, mentoring). Balance of output vs helping. | balance |
PR review ratio | Review effort vs creation effort. Review coverage across the team. | balance |
PR review usefulness | Share of review comments authors mark useful. Feedback quality. | ↑ higher |
PR addressed rate | Share of review comments that led to changes. Iterative development. | ↑ higher |
PR pairing rate | Share of PRs with multiple authors. Pairing and knowledge sharing. | ↑ higher |
Scope of Work — investment mix
Requires ticketing integration with work-type labels. Targets shift heavily by company stage.
Metric | What it measures | Direction |
Roadmap alignment | Share of delivery mapped to roadmap/epics. Strategic vs ad-hoc execution. | ↑ higher |
New stuff % | Share of delivery on net-new features. Innovation investment. | context |
Improvement % | Share of delivery on enhancements to existing features. Product polish. | balance |
KTLO % | Keeping-the-lights-on: maintenance, bug fixes, tech debt. | ↓ lower |
Performance % | Share of delivery on performance and scalability. | context |
Talent — composition and knowledge risk
Metric | What it measures | Direction |
Talent density | Share of team who are high performers (“get things done”). Top 20% often drive 50%+ of output. | ↑ higher |
Knowledge gaps | Share of codebase with only 1–2 experts. Bus-factor and succession risk. | ↓ lower |
Financial — accounting and compliance
Metric | What it measures | Direction |
Capitalizable % | Share of delivery qualifying for capitalization under accounting rules. New development typically qualifies; maintenance doesn’t. | — |
Reading metrics together
No single metric tells the whole story. The combinations below are the ones worth recognizing on sight.
Healthy patterns
High delivery per HC + low defect rate — productive and quality-conscious.
High collaboration ratio + high delivery — strong enablement without sacrificing output.
High roadmap alignment + balanced scope mix — strategic execution, healthy portfolio.
High AI adoption + high delivery lift — AI tools are actually driving productivity.
Warning patterns
Low cycle time + high waste rate — merges are fast, but a lot of work is started and then abandoned—a planning or requirements signal.
High KTLO % + low new stuff % — debt burden is crowding out innovation.
Low collaboration ratio + high knowledge gaps — siloing is building succession risk.
High defect rate + low code coverage — testing discipline has slipped.
High delivery + high revert rate — speed is winning over stability.
Categories at a glance
Category | Metrics | Primary use | Data required |
Delivery | 3 | Productivity, capacity planning | Git data |
Quality | 5 | Code quality, technical excellence | Git + optional coverage |
Efficiency | 5 | Process speed, bottlenecks | Git + optional ticketing |
AI Impact | 5 | AI adoption and effectiveness | AI tool integration |
Collaboration | 5 | Knowledge sharing, reviews | Git data |
Scope of Work | 5 | Investment mix, strategic alignment | Ticketing integration |
Talent | 2 | Composition, knowledge risk | Git data |
Financial | 1 | Accounting and compliance | Ticketing integration |
How Delivery Is Scored
Delivery is the metric most conversations hinge on, so it helps to know how it’s built. The short version: every contribution is scored as magnitude × complexity, boilerplate is filtered out first, penalties adjust for work that was undone or duplicated, and scores roll up cleanly from individual to org. Understanding this lets you interpret a score correctly and explain it to an engineer.
The formula
Delivery score = Magnitude × Complexity Magnitude = how big the change is (size). Complexity = how hard it was (difficulty). |
Complexity measures the task, not the person. A junior engineer on an architectural change earns high complexity; a staff engineer fixing a typo earns low complexity. The system credits what was built, not who built it.
Magnitude — size of the change
A non-linear T-shirt scale. A Medium isn’t twice a Small—bigger changes have disproportionate impact. Maximum is XL regardless of line count, which caps any attempt to inflate by volume.
Size | Value | Typical meaning |
Zero | 0.0 | No relevant changes (closed PR, machine-generated code) |
XXS | 0.15 | Single-line bug fix |
XS | 0.5 | A few lines; documentation typo |
S | 1.0 | Minor fix or comment updates (under ~100 lines) |
M | 3.0 | Small feature or refactor (~50–200 lines) |
L | 5.0 | Significant feature (~200–1,000 lines) |
XL | 8.0 | Major feature or architecture change (over ~1,000 lines) |
Complexity — difficulty of the work
Scored 1.0–3.0 by an AI assessment across five competencies, not by gut feel. It reflects the difficulty demonstrated in the work itself.
Level | Range | Meaning |
Level 1 | 0.5–1.5 | Junior work: follows existing patterns, basic implementation |
Level 2 | 1.5–2.5 | Mid–senior work: independent feature, cross-team coordination |
Level 3 | 2.0–3.0 | Staff+ work: architectural decisions, org-wide impact |
The five competencies: Code (implementation quality), System Design (architecture and scale), Delivery (execution and coordination), Communications (collaboration and docs), and Ownership (decision-making and initiative). Not every competency applies to every artifact—a document has no Code score. The AI assesses a level per applicable competency, weights them by the contributor’s career level, and sums to a single complexity figure.
Worked examples
Same formula, very different shapes of work:
Architectural decision — small code, high difficulty 45 lines changing DB connection pooling, org-wide blast radius. Magnitude S (1.0) × Complexity 3.0 = 3.0 points |
Boilerplate-heavy feature — large code, low difficulty 800 lines adding endpoints across 20 files, following an established pattern. Magnitude L (5.0) × Complexity 1.0 = 5.0 points |
Critical refactor — large and hard 1,200 lines refactoring authentication, security implications, multi-team rollout. Magnitude XL (8.0) × Complexity 3.0 = 24.0 points |
How complexity sums (one PR, IC4 weights): Code L2×0.25 + System Design L3×0.30 + Delivery L2×0.20 + Communications L2×0.15 + Ownership L2×0.10 = 2.3. At magnitude L (5.0), delivery = 11.5 points.
What counts as work
Pensero scores more than code, so enablement work is visible rather than invisible. Each artifact gets its own magnitude × complexity score.
Artifact | What it captures |
Pull request | Code contributions — features, fixes, refactors. Scored only when merged. |
Trunk-based commits | Direct commits to main, grouped by author + day + repo into one unit. |
PR review | Code-review effort: comments, approvals, review discussion. |
Ticket / ticket review | Jira / Linear work items and reviews on them. |
Document / doc review | Design docs, RFCs, architecture writing and feedback. |
Communication | Technical discussion and Q&A (requires integration). |
Why it matters: senior engineers get credit for reviews and mentorship, technical writers show up in delivery, and collaboration is rewarded rather than penalized. Open PRs are tracked for visibility but don’t score until merged.
How each artifact is scored
Each artifact type has its own rules for when it's scored and how credit is split when several people contribute. The differences matter when you're explaining a number to someone.
Pull requests score on merge (or on close), never while open — the delivery date is the merge date, not when the PR was opened. When a PR has multiple authors, Pensero splits it by git blame: it works out who authored each changed line, converts that to each person's share of the total lines changed, and splits both magnitude and delivery proportionally, giving each contributor their own artifact. So a 10-point PR that's 70/30 by lines becomes 7.0 and 3.0 points. This gives fair credit for pairing without one person banking the whole PR.
Documents (design docs, RFCs, wiki pages) score each time they change, but with a 30-day window to prevent edit-padding. Update a doc within 30 days of your last change and the existing artifact is updated in place — its date moves forward and the score recalculates, no new artifact. After 30 days the doc is treated as complete: the artifact freezes and later edits don't create new ones. Multi-author docs split by blame analysis, same as PRs.
Tickets score ticket creation — specifically the quality of how the work is defined (clear requirements, acceptance criteria, context), not its assignment or completion. This is deliberately separate from the work that follows: the person who writes a thorough user story earns ticket-artifact credit, and the person who implements it earns PR credit. Two distinct contributions, scored independently, so good planning is recognized and nothing is double-counted.
Communications (Slack, Teams, Google Chat) score technical discussion tied to real work — messages linked to a merged PR, a completed ticket, or a document, or replies in such a thread. Casual chat, scheduling, and status updates are excluded. Messages are batched hourly and grouped one artifact per thread, per person: all of your messages in a thread combine into a single artifact dated to the thread's start, updated if you add more. Each participant is scored separately for their own contributions. What's explicitly not measured: message count, response time, or reactions — volume isn't the point, substance tied to work is.
Boilerplate filtering
Raw line counts can be inflated by code nobody really authored, so Pensero strips boilerplate before sizing the change: lock files, generated schemas (Protobuf, GraphQL, OpenAPI), migrations and compiled assets, large test fixtures, and whitespace-only reformatting. Magnitude is then assessed on the cleaned diff.
Example A 770-line PR = 450 lines of package-lock.json + 200 lines of test fixtures + 120 lines of real feature code. After filtering, only the 120 feature lines count—so it scores as Medium, not Large. |
This cuts both ways: it stops volume-padding, and it also means necessary boilerplate committed alongside real work doesn’t unfairly inflate or distort the score.
Penalties
Penalties adjust scores so they reflect delivered value, not raw activity. They’re additive but capped at 100% (a score can’t go below zero), and they’re recalculated as new PRs merge—which is why scores can shift after the fact.
Penalty | Amount | When it applies |
Revert | 100% | PR reverts a previous PR. Both the original and the revert lose credit. |
Duplicate code | 100% | PR duplicates another PR’s code (e.g. same branch to a second target). |
Duplicate document | 100% | Document duplicates another’s content. |
Release merge | Partial | PR merges other PRs; only net-new work counts, credit stays with the originals. |
Rework and bugfix deductions exist in the system but are currently disabled for customers.
Reading penalties as a manager
Revert signals a quality or testing gap—a coaching moment, not a scoring quirk: “what testing would have caught this before production?”
Release merge and duplicate-code penalties are normal in feature-branch and multi-environment (dev → staging → prod) workflows. They just prevent double-counting; no concern.
Duplicate-document penalties catch copy-paste gaming and also accidental uncoordinated duplicate design docs.
False positives can be flagged for review and overridden by admins in rare cases.
How scores roll up
Aggregation is simple summation with no tricks. Artifact scores (already post-penalty) sum to the individual total, and individuals sum to the org total. Work is attributed to its original owner; reviews to the reviewer; collaborative work like multi-authored PRs is split automatically. Only completed artifacts in the date range count—merge date for PRs, completion date for tickets, publish date for documents.
Timing to set expectations with your team: an artifact scores within minutes of merge, then a background job checks for penalties over the next day or two and may adjust the score retroactively. Day-to-day numbers fluctuate; weekly is the right cadence to read delivery.
Reading delivery well
The healthy and warning signals below matter more than any single number.
Healthy signs
Steady week-to-week rather than erratic.
A balanced artifact mix—not 100% PRs, but reviews and docs too.
Complexity that tracks seniority (a senior’s work shows higher complexity than a junior’s).
Warning signs
Wide variance within a team (some carrying far more than others) — load imbalance or skill gap.
A sustained declining trend — process friction or mounting tech debt.
Very low review contribution across the team — siloing.
Context first: compare engineers at similar levels, account for team type (platform teams deliver fewer, higher-complexity points than product teams), and remember some real work—verbal mentorship, on-call, dev-tooling—isn’t captured at all.
Common questions
Can engineers game their score?
It’s difficult and self-defeating. Boilerplate filtering removes padding, magnitude caps at XL no matter the line count, complexity is assessed rather than counted, and penalties strip duplicated or reverted work. Because the system spans many metrics, gaming delivery usually degrades quality or efficiency. Trivial-change spam stays XXS × Level 1—effectively worthless. The real defense is cultural: when scores drive support rather than punishment, the incentive to game disappears.
What about glue work that doesn’t show up?
Reviews, RFCs, design docs, and technical discussion are scored, and high-complexity unblocking work scores well. But verbal mentorship, process improvements like CI/CD, and on-call firefighting aren’t captured yet. So when delivery looks low, ask: “are you doing glue work or mentorship that isn’t captured?” Treat the metric as the start of the conversation.
Why did the score change retroactively?
Penalties run asynchronously after merge. A score can appear immediately, then drop a day later when overlap with another PR, a revert, or a duplicate is detected. This is expected—check weekly, not daily.
How do I explain a low score to an engineer?
Gather context first: check the artifact mix (mostly reviews?), recent penalties (reverts or duplicates?), complexity (strategic or architectural work takes longer), timing (ramping on a new project?), and external factors (PTO, on-call, incidents). Then frame it with curiosity, not accusation—“I see your delivery is lower than usual, what’s taking up your time and how can I help?” Metrics reveal symptoms; the conversation finds the diagnosis.
How Quality Is Measured
The five quality metrics are defined in What Pensero Measures. This section covers what sits underneath them—how Pensero detects a bug, traces it to the PR that caused it, and spots rework, reverts, and duplication automatically. Knowing the detection logic lets you trust the numbers and explain them without sounding like you’re assigning blame.
Detection runs automatically on every merge
When a PR merges, Pensero analyzes it within minutes: it decides whether the PR fixes a bug, traces a bugfix back to the change that introduced it, compares the diff against recent PRs to catch rework, reverts, and duplicates, and pulls in test-coverage data if a coverage tool is connected. Results land in the quality metrics within hours, and historical data updates if a bugfix links back to an older PR.
Consistent and bias-free: the same analysis runs for everyone, engineers can see their own metrics and why they moved, and the focus is the trend over time—bugs happen, what matters is the direction.
How bugs are traced to their source
The hard question is: when a bugfix lands, how does Pensero know which PR caused the bug? It identifies the files and lines the fix touches, walks git history to find which PR last changed those lines, and links the two—defect introduced ↔ defect fixed.
Example March 1: Jane merges PR #100 adding authentication logic. March 15: Tom merges PR #150, “Fix login crash,” touching lines 45–60 of auth.py. Those lines trace back to Jane’s PR #100. Jane’s defect rate reflects the source; Tom gets credit for the fix. |
So engineers are credited for fixing bugs, see the impact when their own code introduces one, and the org sees the total cost of defects—all without a manager having to adjudicate.
How each metric is detected
Defect rate — bugs the org introduced
AI reads each PR’s title, description, and diff to decide whether it fixes broken behavior—a crash, wrong result, logic error, or previously-failing functionality. It’s deliberately conservative: feature work, refactoring, performance tuning, security hardening, config changes, and dependency bumps don’t count. It then isolates which lines are the actual fix versus unrelated cleanup, so only the bug-fixing share of the PR counts toward the defect signal.
Rework rate — rewriting recent code
Pensero compares each PR line-by-line against PRs merged in roughly the last month in the same repo. When a high share of lines a recent PR added are now being removed or replaced, that’s rework—lower overlap is just normal evolution. The useful distinction for conversations: rework is replacing code because the first attempt was wrong (often from unclear requirements); refactoring improves structure while preserving behavior and shows up as improvement work, not rework. The metric is built to separate the two.
Revert rate — rolled-back PRs
Detected from revert/rollback patterns in titles and git metadata, then verified by checking that the revert PR’s removed lines closely match the original’s added lines. This is a count-based signal (a PR was either rolled back or it wasn’t), so size doesn’t factor in. A revert is the most severe quality signal—code that couldn’t be fixed incrementally and had to be undone.
Duplicate code rate — copy-pasted code
Content from each changed line is normalized (whitespace and formatting ignored), hashed, and matched against earlier PRs. A high share of identical-hash lines flags duplication. Because it’s exact content matching, renaming variables or reformatting won’t hide it—and won’t cause false hits from formatting alone.
Code coverage — test coverage
Coverage comes from an external tool (Codecov today; others planned). On merge, Pensero fetches the diff coverage—the share of new lines tested—and rolls it up weighted by PR size, so a large under-tested PR moves the number more than a tiny fully-tested one. Without a coverage integration this metric shows N/A. One caveat worth repeating to teams: coverage shows what’s tested, not whether the tests are any good.
Reading the metrics together
As with delivery, combinations tell the story. The pairings below map a metric pattern to its likely cause—use them to read a team at a glance.
Pattern | Likely meaning |
Low defect + high coverage | Testing discipline is working. |
High defect + low coverage | Insufficient testing; bugs slipping through. |
High rework + high duplicate | Poor code organization—messy, hard to maintain, frequently rewritten. |
High rework + high revert | Unstable requirements or rushed code. |
Quality sliding across consecutive periods | Compounding pressure: ramping hires, deadlines, or accruing tech debt. |
Context still governs. High rework is normal for an early-stage product finding fit and expected briefly after a big refactor, but concerning in a mature product. A blanket coverage mandate tends to backfire—engineers write hollow tests to hit the number. Better to require coverage on critical paths (auth, payments, data integrity) and watch the trend.
The delivery-vs-quality quadrant
Pensero plots a quadrant chart—delivery on the x-axis, a quality score (the inverse of defect rate) on the y-axis—to show the trade-off between shipping and stability at a glance. The four quadrants:
Quadrant | What it suggests |
High delivery, high quality | Sustainable practices—shipping steadily without introducing bugs. |
High delivery, low quality | Shipping fast but introducing defects—often a sign of pressure to ship without enough testing. |
Low delivery, high quality | Careful but slow—may reflect genuinely complex work or an overly cautious culture. |
Low delivery, low quality | Where support is most likely needed—worth understanding why before concluding anything. |
Read the quadrant at the population level, not as a verdict on individuals. A cluster in “high delivery, low quality” points at a systemic incentive to ship without testing; a cluster in “low delivery, high quality” may signal work that’s harder than it looks or a culture that’s over-indexing on caution. Use a person’s position as the opening of a conversation—“what’s the work in this area like right now?”—rather than a label. The same caveats from delivery scoring apply: complex and enablement work shifts where someone lands.
Using quality data in conversations
Lead with curiosity, not a verdict. “Your defect rate is higher than usual—what’s going on, and where do you need support?” surfaces the real cause; “your defect rate is unacceptable” just creates defensiveness. Common root causes worth probing: unfamiliar or complex area (pair with a senior), no time for tests (revisit priorities and deadlines), an edge case the engineer didn’t recognize as a bug (testing guidance), or shifting requirements (escalate to product for clarity).
In the product, org-wide rates and trend charts live on the Quality page; per-person quality sits in the Quality → People tab, sortable and drillable for 1-on-1 prep. Use it to find who might need support, not to rank people.
Can engineers game quality metrics?
Mostly no. Defect detection reads PR content, so relabeling a bugfix as a “refactor” doesn’t fool it. Rework and duplication are line-level content matches—renaming or reformatting won’t hide them. Reverts come straight from authoritative git history.
The one soft spot is coverage: you can write tests that exercise code without really asserting anything. The defense is review—check test quality, not just the percentage.
How Reviews & Collaboration Are Measured
The five review metrics are defined in What Pensero Measures. This section covers the mechanics underneath—how a review comment gets judged useful or not, how Pensero knows whether feedback was acted on, and how it detects pairing—plus the collaboration patterns worth recognizing. Reviews are where knowledge sharing and peer quality-control happen, so these signals tell you whether your org is learning together or working in silos.
What counts as review work
Review (or enablement) work is anything that helps someone else ship: reviewing PRs, documents, or tickets; co-authoring through pairing or mobbing; and technical help in messages and Q&A. Approving your own PR and creating your own work don’t count. The metrics weigh this enablement effort against creation effort—the point isn’t to maximize either, but to see the balance.
How the harder metrics are detected
Review usefulness — is the feedback substantive?
AI classifies every review comment. It counts as useful when it flags a defect (bug, bad logic, security issue), a weakness (performance problem, missing edge case), or a concrete suggestion for a better approach. It’s noise when it’s a bare “LGTM,” a style nitpick, or a non-technical aside. The metric is the useful share—so it measures whether reviews catch real issues or just rubber-stamp.
Addressed rate — was the feedback acted on?
Pensero checks whether the author pushed changes after a comment: fully addressed, partially addressed, ignored, or undetermined (e.g. a comment landing right before merge). Only useful comments count toward the rate—noise is excluded—and it’s weighted by severity, so resolving a critical bug matters more than a style note. The revealing combination is usefulness and addressed rate together: useful feedback that’s consistently ignored is a cultural signal, not a tooling one.
Pairing rate — collaborative authorship
A PR counts as paired when it has multiple commit authors, Co-authored-by tags, or other evidence of joint work. Pairing is deliberately expensive—two people on one task—so the goal isn’t a high number but the right uses: onboarding, unfamiliar or complex problems, and spreading knowledge in critical systems.
Collaboration patterns to recognize
The patterns matter more than any single rate. Each below is a combination you can spot at a glance and what it tends to mean.
Pattern | What it looks like and means |
Healthy collaboration | Balanced creation and review, useful feedback that gets acted on, some pairing. Knowledge moves and quality holds. |
Siloing risk | Very low review and pairing, too few reviews to even measure quality. Knowledge concentrates—bus-factor and coverage risk if someone leaves or takes leave. |
Review theater | Lots of review activity but low usefulness and low addressed rate. The motions of review without the value—“LGTM” culture, feedback ignored, a false sense of safety. |
Over-collaboration | Review far exceeds creation, constant pairing. Feedback quality may be high, but delivery drags under process overhead; can be right during big refactors or heavy onboarding. |
Review theater is the one most worth watching for: high activity makes it look healthy on a dashboard, but low usefulness plus low addressed rate means the review process is a checkbox, not a quality gate.
The delivery-vs-reviews quadrant
A parallel to the delivery-vs-quality quadrant: delivery on the x-axis, enablement (review) activity on the y-axis. It surfaces collaboration style at a glance.
Quadrant | What it suggests |
High delivery, high reviews | Multipliers—shipping their own work and lifting others. Seniors often land here. |
High delivery, low reviews | Solo contributors—shipping independently; watch for knowledge-sharing risk. |
Low delivery, high reviews | Enablers—focused on helping others; appropriate for some roles, worth a check for juniors. |
Low delivery, low reviews | May be blocked or need support—understand why before concluding. |
Read it at the population level. A senior sitting in “solo contributor” may need a nudge on mentorship expectations; a junior parked in “enabler” may be reviewing when they should be building; a wide healthy spread is a balanced culture. As always, position opens a conversation rather than settling one—role and project phase move where someone lands.
Reading review ratios in context
There is no universal “right” review ratio—it tracks role, seniority, and project phase. Seniors run higher (they’re multipliers, mentoring and unblocking); juniors run lower while they learn and build; onboarding periods push everyone’s ratios up. Rather than mandating a number, watch the quality signals (usefulness, addressed rate) and the trend. A low pairing rate isn’t a problem on its own—but low pairing plus low review ratio plus high knowledge gaps is a real siloing signal.
In the product, org-wide review metrics and trends are on the Reviews page; per-person collaboration sits in the Reviews → People tab for spotting who’s siloed or over-extended before a 1-on-1.
Common questions
Should I mandate code reviews?
Most orgs already require an approval before merge; the real question is whether review is effective. Effective review shows up as substantive feedback that gets acted on without overwhelming people’s time. Ineffective review is rubber-stamping, ignored feedback, or so much review activity that it becomes pure overhead. Measuring usefulness and addressed rate—and recognizing people who give valuable reviews—beats mandating a quota.
Is a low pairing rate bad?
Not by itself. Pairing earns its cost on onboarding, unfamiliar or complex problems, and critical-system knowledge spread; solo work is fine for well-understood tasks and clearly-owned parallel work. It only becomes a red flag alongside low review ratio and high knowledge gaps.
Review usefulness looks low—what now?
Investigate before reacting. Low usefulness usually means people are rushing reviews, an “LGTM” norm has set in, reviewers lack the context to add value, or PRs are too large to review well. Sampling the low-usefulness comments reveals which. The fixes follow the cause: coach on what a good review catches (logic, not just style), encourage smaller focused PRs, and route PRs to reviewers with the right domain knowledge.
How Scope of Work Is Categorized
The six scope-of-work metrics are defined in What Pensero Measures. This section covers how every piece of work gets sorted into a category in the first place, and how to read the resulting investment mix. Think of scope of work as your engineering investment portfolio—where time actually goes, which is often not where you assume it goes.
How work gets classified
When work completes—a PR merges, a ticket closes, a document publishes—AI reads its content (title, description, commit messages, ticket labels) and sorts it into one of five categories. The signals it weighs are keywords (“add,” “fix,” “optimize,” “refactor”), structure (new files vs. modified files), ticket labels (bug, feature, enhancement), and the semantic purpose of the change.
Category | What lands here |
New stuff | Building something that didn’t exist—new features, modules, endpoints, services. |
Improvement | Enhancing something that already exists—adding to or refining a current feature. |
KTLO | Keeping the lights on—bug fixes (any severity), security patches, tech debt, dependency updates, infra maintenance. |
Performance | Scalability and optimization—caching, query tuning, sharding, architecture for scale. |
Other | Work that doesn’t fit the above. |
New stuff and improvement are the easy pair to confuse: “add user search” is new stuff (it didn’t exist); “add filters to user search” is improvement (enhancing what’s there). The five shipping categories sum to 100% of delivery.
Capitalizable % is derived, not classified. It’s simply new stuff plus improvement—the development that creates new assets (CapEx) versus maintenance and operations (OpEx). No separate tagging; it falls out of the categories above, which is what makes it audit-friendly.
Roadmap alignment is configured, not inferred
Unlike the five categories, roadmap alignment isn’t an AI judgment about the nature of the work—it’s about whether work connects to your strategic tracking. Each org configures what “aligned” means: specific epic IDs, projects or initiatives, custom labels like “roadmap,” certain ticket states, or bespoke query logic matching your planning process. Ad-hoc bug fixes, unplanned experiments, incidents, and untagged work fall outside it. So roadmap alignment measures planning discipline—executing the plan versus firefighting—rather than what kind of code was written.
Reading the investment mix
The mix is a portfolio, and the patterns matter more than any single percentage. Three worth recognizing:
Pattern | What it looks like and means |
Balanced portfolio | Most work tied to roadmap, real innovation capacity, steady polish, manageable maintenance. Executing strategy while holding quality. |
Tech-debt crisis | Maintenance consuming a large share, roadmap alignment and new-stuff both depressed. Firefighting mode—little room to innovate, and capitalizable % drops with it. |
Premature optimization | Heavy performance investment with low innovation for the stage. If user volume is low, scale work is stealing from features you actually need. |
Stage governs what “good” looks like—see the stage cues in the glossary. Broadly, new-stuff share starts high at MVP and declines as a product matures, while KTLO and improvement rise. Performance should be low pre-scale (don’t optimize prematurely) but climbs when you’re actually scaling—and low performance investment during hypergrowth is the warning sign, not low investment early on. The same number means opposite things at different stages.
Using scope data well
The highest-value move is comparing your actual mix against your intended one, then acting on the gap: “we’re at 38% KTLO, so let’s ring-fence 20% of next sprint for debt reduction.” Trends beat snapshots—a KTLO number climbing month over month, or new-stuff steadily shrinking, says more than any single reading. Org-wide mix and trends live on the Work page; per-person category breakdowns are in the Work → People tab.
Roadmap alignment as an early signal: a sharp drop usually has a concrete cause—a major incident, a wave of customer escalations, urgent tech debt, or planning that slipped. It’s worth a “what pulled us off-plan?” conversation rather than a verdict on the team.
Common questions
How accurate is the AI categorization?
Generally high—it’s trained on many examples, draws on multiple signals (title, description, labels, code), and is conservative when uncertain. You can click any work item to see its category and reasoning, and admins can override edge cases (the original classification is preserved for audit). Overrides should stay rare exceptions; routine re-tagging undermines the consistency that makes the data trustworthy.
What about mixed or ambiguous work?
Edge cases like “refactor for performance” (KTLO or performance?) or “fix a bug and add a feature” (KTLO or new stuff?) are resolved by primary intent—the AI picks the category reflecting the bulk of the effort. It’s a reasonable default, and the per-item view lets you check anything that looks off.
Is a high KTLO number bad?
Depends on whether it’s a spike or a pattern. A one-month jump from incident response, security patching, or a deliberate debt sprint will normalize next period. The same level sustained across several months points to a systemic quality or debt problem—cross-check the defect rate, and consider a dedicated debt-reduction sprint plus upstream testing to stop bugs at the source.
How Efficiency Is Measured
The five efficiency metrics are defined in What Pensero Measures. This section covers two things: the statistical views Pensero offers for the time metrics—P90, P80, median, and average—and why the default is P90, and, more usefully, how to read the four time metrics as a sequence to pinpoint exactly where delivery is getting stuck. Efficiency is about finding where time leaks, not pushing people to move faster.
Why P90, and the other statistical views
All four time metrics default to the 90th percentile: 90% of work finishes within the stated time. But Pensero lets you switch the view — P90, P80, Median (P50), or Average — from the dropdown on efficiency charts, and each answers a slightly different question.
Averages get wrecked by a single stalled PR. Four PRs that merge in 2–5 hours plus one stuck for a week average out to ~36 hours, which describes none of them. The median lands at 4 hours and P90 at 5 — both far closer to what engineers actually experience. P90 strips the legitimate outliers (genuinely complex edge cases) and surfaces the systemic delay you can act on. It's also the DORA-standard approach and far less volatile when comparing teams or periods.
When to reach for each:
P90 (default) — finding systemic bottlenecks, comparing teams or periods, setting SLAs ("90% of PRs merge within a day").
P80 — a slightly tighter read when P90 feels too lenient.
Median (P50) — the typical engineer's experience, when outliers are genuinely exceptional rather than systemic.
Average — overall trend, most useful read alongside P90 to understand the distribution.
The gap between views is itself the signal. A large average-to-P90 gap means a long tail — most work is fast, a few items drag. To tell "occasional outliers" from "everything is slow," compare median to P90: if the median sits close to the average, the distribution is skewed by a handful of outliers; if the median is close to P90, most PRs really are slow. That distinction changes the fix — chase the few stuck PRs, or rework the process itself.
A large gap between P90 and average isn’t a problem—it just means the distribution has a long tail. That’s the signal P90 is designed to expose, so you can go fix the tail.
The four time metrics, and what sits between them
Each metric measures from PR creation (or ticket start) to a later moment. Read end to end, the gaps between them tell you where the time goes.
Metric | Measures from → to | The gap before it represents |
Time to comment | PR created → first comment | How long until anyone looks at it (attention). |
Time to approve | PR created → first approval | Comment → approve: how long the actual review takes. |
Time to merge | PR created → merge | Approve → merge: CI, merge queue, deployment steps. |
Cycle time | Ticket assigned → PR merge | The full end-to-end, including time-to-start-coding before the PR existed. |
Cycle time needs ticketing (Jira, Linear) with start timestamps and PRs linked to tickets. The three time-to-* metrics and waste rate come from git data alone, so they’re available even without a ticketing integration.
Diagnosing a bottleneck from the progression
Because the metrics nest, the shape of the progression tells you where to look. Match your numbers to one of these:
Where the jump is | Diagnosis | What to try |
Fast comment, slow approve | Review itself is the bottleneck—PRs get noticed but approval drags. | Add approved reviewers; revisit strict multi-approval policies; check for a single-approver chokepoint. |
Slow comment and approve | PRs sit idle—nobody looks until late. | Review rotation, PR notifications, assign reviewers at creation, budget review time into sprint capacity. |
Fast approve, slow merge | A big approve→merge gap points outside review—CI, merge queue, or deployment. | Profile and parallelize CI, check merge-queue policy, automate manual deploy steps. |
Fast throughout, low waste | Healthy process—maintain it. | No action needed. |
Waste rate — work that never shipped
Waste rate is the share of PRs closed without merging—code written but never delivered: abandoned experiments, wrong-direction work after requirements shifted, duplicates, and stale PRs nobody reviewed. Open PRs (still in progress) and merged PRs don’t count, and merged-then-rewritten work is rework, which lives in the quality metrics, not here.
It measures planning quality, not code quality. Some waste is healthy—experiments fail, prototypes get discarded, startups pivot on feedback. The question to ask is why PRs are closing: intentional learning, or dysfunction (premature starts on unclear requirements, duplicated work from poor communication, forgotten stale PRs)? An innovation team running many experiments will and should waste more than a mature product team with clear requirements.
Reading efficiency in context
Compare like with like—team type sets the baseline. Product teams move fastest and waste more (simple changes, heavy experimentation). Platform teams are slower by nature (complex changes that affect many teams need careful review) and waste less (more planned work). Infrastructure / SRE sits in between, with careful review because production impact is high. Comparing a platform team’s cycle time to a product team’s and concluding the platform team is “slow” is the classic misread.
Org-wide P90s with trend indicators are on the Efficiency page; per-person times are in the Efficiency → People tab. As everywhere, the trend matters more than the absolute—pick one bottleneck, change one thing, and watch whether the number moves.
Common questions
Can we improve speed without hurting quality?
Yes—the two aren’t opposed. The efficiency gains that help quality are exactly the ones to pursue: faster reviews catch issues sooner, clear requirements cut rework, parallelized CI speeds merges without risk, and better communication prevents duplicate work. What you don’t do is buy speed by skipping review, cutting coverage, or merging without approval—that just moves the cost into the quality metrics.
What if we don’t use tickets?
You lose cycle time (it needs ticket start timestamps), but the three time-to-* metrics and waste rate still work from git data alone. GitHub Issues/Projects can stand in if configured; ad-hoc spreadsheets won’t integrate.
How AI Impact Is Measured
The AI Impact metrics are defined in What Pensero Measures. This section covers where the data comes from and how to read adoption against effectiveness. The core idea: track both whether people are using AI tools and whether the usage is actually helping—high adoption with poor results means training, not celebration.
Which tools are tracked, and how
Pensero integrates with the major AI coding assistants—Cursor, GitHub Copilot, Claude Code, and Gemini Code Assist—via OAuth or API keys. It pulls usage daily (lines generated, tokens, cost), attributes that usage to specific commits and PRs, and rolls it up. This requires admin access to the AI-tool accounts, credentials configured in Pensero, and each engineer’s AI-tool account linked to their Pensero profile. Without those links, usage can’t be attributed.
Adoption vs. effectiveness
The metrics split into two questions. Adoption—are people using AI?—is captured by user adoption (the share of active engineers using AI at all) and AI-assisted % (the share of merged lines that were AI-generated). Effectiveness and cost—is it worth it?—is captured by tokens per delivery (efficiency, lower is better), AI cost (total spend), and delivery lift (output change versus a prior period). Read together, they tell you whether spend is translating into reach and into output.
Reading | What it suggests |
High adoption, rising delivery lift | Tools are landing—reach and output both moving. |
High cost, low AI-assisted % | Paying for capacity that isn’t being used—check access, training, unused licenses. |
High AI-assisted %, climbing tokens per delivery | Usage is getting less efficient—over-long prompts, repeated retries, or AI used where it isn’t needed. |
Low user adoption after many months | Barriers, not early days—access, awareness, or cultural resistance. |
How the key figures are derived
Each AI tool reports lines generated per user per day; Pensero attributes those to commits, then PRs, then the org. AI-assisted % is AI-generated lines over total merged lines. Tokens per delivery is total tokens consumed over delivery points—a token being roughly four characters of prompt-in or code-out, so longer prompts and more retries cost more. User adoption counts active engineers (those who merged at least one PR) with any AI usage. Delivery lift compares this period’s delivery to a prior period’s.
Treat delivery lift carefully: correlation isn’t causation. It shows output moved during AI adoption, but headcount changes, simpler work, or seasonality can all contribute. Use it as a signal to investigate, not as proof on its own—and compare like with like (same team, same kind of work) before drawing conclusions.
Reading AI metrics in context
AI-assisted % varies legitimately by work and person—UI and greenfield code lean higher, complex algorithms and legacy refactors lower; juniors often lean on it more than seniors. Adoption also takes time to build, so a low number in the first months is early days, while the same number a year in signals a barrier worth investigating. Org-wide AI metrics and trends live on the AI Impact page; per-person figures are in the AI Impact → People tab.
Common questions
Is high AI usage always good?
Not on its own. Usage is a means, not the goal—what matters is whether it produces quality output efficiently. Pair AI-assisted % with the efficiency and quality signals: high usage alongside climbing tokens per delivery or rising defects means someone is leaning on AI without getting clean, efficient results. The move there is coaching, not a usage target.
Can we cut cost without losing value?
Usually yes. Tokens per delivery surfaces inefficient usage—whoever’s burning far more than the team norm is often getting poor suggestions and retrying, which prompt-engineering coaching fixes. AI isn’t the right tool for every task (simple edits and boilerplate rarely need it), and user adoption reveals paid-but-unused licenses to reclaim. High aggregate usage is also leverage for volume pricing.
What if AI makes engineers over-reliant?
Watch for high usage paired with degrading quality signals—more reverts or rework. Healthy AI use looks like an engineer who still understands and can explain their code, reviews suggestions critically, and uses AI to accelerate the obvious rather than to skip the thinking. The cultural framing that works: AI accelerates the routine so you can spend judgment on the hard parts—it doesn’t replace the judgment.
Comparison Tools: Benchmark & Calibrate
Two executive-only pages answer different comparison questions. Benchmark asks “how are we doing versus the industry, over time?” Calibrate asks “how do these groups compare to each other, right now?” The quick rule: Benchmark for trends and board context, Calibrate for talent and resourcing decisions.
Both surface metrics already defined in What Pensero Measures. One naming note: Benchmark and Calibrate label new-feature investment “Innovation rate,” which is the same metric the glossary calls New stuff %.
Benchmark — you vs. the industry
Benchmark (Organization › Benchmark) plots your org against the anonymized median of all Pensero organizations, on a 0–100 percentile scale: 50th is the industry median, 75th means you’re ahead of three-quarters of organizations, 25th puts you in the bottom quarter. It shows 26 weeks of history per metric as a trend line, so you see direction as well as position. Higher is better for most metrics—but defect rate, cycle time, and knowledge gaps invert (lower ranks higher). The summary view lists all ten metrics with current percentiles; clicking any one jumps to its detail page.
It covers ten metrics: delivery per headcount, defect rate, AI-assisted code, collaboration, innovation rate (new stuff %), roadmap alignment, cycle time, capitalizable %, talent density, and knowledge gaps. The comparison is against all Pensero organizations regardless of size, industry, or stage—there’s no segment filter yet, which is worth remembering before reading too much into a single percentile.
Reading it: a percentile is a question, not a verdict. Below the median can be entirely appropriate—foundational investment that depresses short-term delivery, a ramping team, or a deliberate quality-over-speed posture. Use it to ask what’s driving the position.
Calibrate — groups side by side
Calibrate (Organization › Calibrate) is a matrix: metrics down the rows, comparison groups across the columns, each cell color-coded. Two columns are always present—Company (your whole org) and Industry (the Pensero median)—and you add up to ten custom columns for any mix of individuals, teams, saved cohorts, or filters (e.g. “backend developers hired in 2025”). It’s built for talent-calibration sessions and resourcing decisions.
The color coding is relative to both baselines at once, which is what makes it readable at a glance:
Cell color | Meaning |
Dark green | Better than both company and industry—excelling; worth understanding what they do differently. |
Light green | Better than company but below industry—strong internally, room to reach industry-leading. |
Light red | Below company but above industry—your org is high-performing overall; this group lags it. |
Dark red | Below both—investigate: blockers, priorities, or training need. |
Gray | Not enough data to score (low activity, empty group, or metric not applicable). |
Calibrate shows eleven metrics—the benchmark ten plus active headcount (shown for context, not color-coded), and AI user adoption in place of AI-assisted code. Add columns via “Add cohorts,” hover any cell to see the underlying value and highlight its row and column, and remove columns from their header. There’s a ten-column ceiling; to compare more groups, swap columns across sessions or compare cohorts instead of individuals. Note the matrix can’t be exported yet—screenshot or copy values to share.
Color is relative, so context still governs. A dark-red cell can be exactly right—a newly-formed team ramping, or a group deliberately on harder problems. Read it as “where should I look,” not “who’s failing.”
Which to use
If you need to… | Use |
Track org trends over time | Benchmark |
Compare to the industry | Benchmark (Calibrate has it as a reference column) |
Compare teams or individuals side by side | Calibrate |
Run a talent-calibration session | Calibrate |
Build a board presentation | Benchmark |
Make a resource-allocation decision | Calibrate |
Do historical analysis | Benchmark |
Applied Scenarios
This is the section to reach for when you’re facing an actual situation. Every scenario below makes the same point in a different way: no single metric tells you what’s going on. A number that looks alarming on its own usually becomes legible—or flips meaning entirely—once you read it alongside two or three others. Each case shows the misleading single-metric read first, then the fuller picture, then what to do.
The “low performer” who is actually the team’s glue
Situation: An engineer’s delivery is well below the team’s. On a delivery-only view they look like an underperformer, and a quarterly ranking would put them at the bottom.
The single-metric trap: Reading delivery alone, you’d coach them to “ship more”—or worse, flag them in a review.
The fuller picture: Pull their collaboration and quality signals next to delivery. Their collaboration ratio is high, PR review usefulness is among the team’s best, and knowledge gaps in the areas they touch are low. On the delivery-vs-reviews quadrant they sit in enabler—low delivery, high enablement. They’re the person unblocking everyone else, reviewing the hardest PRs, and spreading knowledge that keeps the bus-factor down. Their low personal delivery is the cost of lifting everyone else’s.
Action: Recognize the enablement explicitly—it’s real work the delivery number doesn’t capture. The risk here isn’t under-performance; it’s that this person is invisible to a metrics-naive review and may burn out or leave. If anything, check the inverse: is the team leaning on them so heavily that their own growth is stalling?
The high deliverer who is quietly a risk
Situation: An engineer tops the delivery chart, quarter after quarter. The obvious read is “star—promote and replicate.”
The single-metric trap: Delivery alone says star performer. But output is only one dimension.
The fuller picture: Their collaboration ratio is near zero, PR pairing rate is negligible, and the knowledge gaps metric shows they’re the sole contributor across a large share of the codebase. They’re a solo contributor on the quadrant: shipping fast, sharing nothing. If their defect or revert rate is also creeping up, the speed may be coming at quality’s expense. The high delivery is real—but it’s concentrating critical knowledge in one person and building succession risk.
Action: Don’t punish the output—redirect some of it. Pair them with others, route reviews through them so knowledge spreads, and make enablement an explicit expectation if they’re senior. The single-metric celebration would have deepened exactly the risk you most want to avoid.
The team working hard but not shipping
Situation: A team is putting in long hours but output is flat, and leadership is asking whether they’re productive enough.
The single-metric trap: Low delivery per head looks like a performance or staffing problem—the instinct is to push harder or question headcount.
The fuller picture: Efficiency tells a different story. Cycle time and time to merge are high, and the bottleneck progression points to review: fast time-to-comment but slow time-to-approve. The PR review ratio shows two people carrying nearly all reviews.
The team isn’t unproductive—its work is stuck in a review queue because review load is concentrated on a couple of engineers. Delivery is low because finished code is waiting, not because people aren’t working.
Action: Fix the process, not the people: spread reviewer load, set a review-response expectation, and watch cycle time and time-to-approve fall. Pushing the team to “work harder” would have made the real bottleneck worse.
A quality regression with a hidden cause
Situation: Customers are reporting more bugs and support is escalating. You need to know the scope and the cause quickly.
The single-metric trap: Defect rate is up—so the easy conclusion is “engineers got sloppy,” which invites blame and fixes nothing.
The fuller picture: Triangulate. Defect rate has jumped, but isolate it by team and it’s concentrated in one group. There, code coverage has dropped and PR review usefulness has fallen—reviews stopped catching issues. Cross-reference scope of work: that team’s KTLO % spiked and roadmap alignment dropped, meaning they were buried in reactive work. And the People view shows two senior engineers were out, leaving less-experienced members merging unreviewed code. The regression isn’t sloppiness—it’s a team overwhelmed by maintenance with its safety nets temporarily down.
Action: Treat the cause, not the symptom: pause feature pressure on that team, bring in review support while the seniors are out, restore a coverage expectation. The defect-rate number was the smoke; the cause only appeared when you read coverage, reviews, scope, and staffing together.
Defending (or questioning) the AI tool spend
Situation: Finance wants to know whether the AI coding tools are worth the spend.
The single-metric trap: Pointing only at high AI-assisted % proves usage, not value—and pointing only at delivery lift invites the fair objection that other things changed too.
The fuller picture: Read adoption, effectiveness, and quality together. User adoption and AI-assisted % show the tools are actually used, not shelfware. Tokens per delivery shows whether that usage is efficient. Crucially, check whether defect and revert rates held steady as AI usage rose—adoption with stable quality is the real signal that the tools help rather than just generate more code to clean up. Delivery lift is supporting evidence, treated as correlation, compared like-for-like.
Action: Bring the combination, not one headline number, and state the honest caveat on delivery lift. If usage is high but tokens-per-delivery is poor or quality slipped, that’s a coaching-and-optimization finding, not a cancel-the-tools finding. (Note: this guide deliberately avoids a single ROI multiplier—the metric combination is more honest and more durable than a manufactured payback figure.)
The through-line
In every case, the single metric pointed one way and the truth lay in the combination. Build the habit: when a number surprises you, don’t act on it—ask which two or three other metrics would confirm or overturn the obvious reading. Delivery next to collaboration. Defect rate next to coverage, reviews, scope, and staffing. AI usage next to quality. The metric tells you where to look; the combination tells you what’s true; and the conversation with the person tells you why.