Contents

4. How Delivery Is Scored

5. How Quality Is Measured

6. How Reviews & Collaboration Are Measured

7. How Scope of Work Is Categorized

8. How Efficiency Is Measured

9. How AI Impact Is Measured

10. Comparison Tools: Benchmark & Calibrate

11. Applied Scenarios

What Pensero Is

Pensero is an AI-native engineering performance platform. It automatically collects data from your engineering tools—Git providers, ticketing systems, docs, AI coding tools, coverage, and calendars—and turns it into real-time, explained metrics across your whole organization. No manual data gathering, no spreadsheet wrangling.

Where most tools show one dimension (commits, or tickets, or cycle time), Pensero integrates them into a single picture and explains what changed and why, rather than leaving you to investigate across five dashboards.

The questions it answers

Every metric maps to a leadership conversation. The ones you’ll reach for most:

Conversation	Metrics that drive it
Are we delivering enough?	Delivery per HC, Active headcount
Is code quality declining?	Defect rate, Rework rate, Code coverage
Why are projects taking so long?	Cycle time, Time to merge, Waste rate
Is our AI investment working?	AI-assisted %, Delivery lift, Tokens per delivery
Are we too siloed?	Collaboration ratio, Knowledge gaps, PR pairing rate
Are we strategic or reactive?	Roadmap alignment, New stuff %, KTLO %
Do we have succession risks?	Knowledge gaps, Talent density
How do we compare to others?	Benchmark percentiles across all metrics

Reading metrics in context

Numbers without context mislead. The same figure can be healthy or alarming depending on the team—a team at 30% KTLO work may be fine as a platform team, concerning as a product team, and critical at an early-stage startup. Pensero supplies that context through benchmarks by company stage and team type, historical trends, and pattern recognition. The What Pensero Measures section that follows spells out the ranges and the healthy-vs-warning combinations to watch.

Principle: data informs decisions; it doesn’t make them. Use it to drive supportive conversations and systemic improvements, never to police individuals.

Navigating Pensero

Three controls do most of the work: the date range (what period), the scope filter (which people), and the navigation menu (which metric). Set the first two and every page obeys them. This section is the how-to for getting to the right view fast.

Date range

The selector sits at the top of every metrics page: arrows step backward and forward by the current period, and the center dropdown changes the period type—Week (Mon–Sun), Month, Quarter, or an organizational Cycle if configured. For ad-hoc windows, pick “Days to date” (a rolling last-N-days), “Months to date,” or a custom start/end (up to 365 days). It persists as you move between pages, survives refreshes, and lives in the URL—so you can bookmark or share a specific period. All dates use your org’s timezone, and most charts automatically show “vs. previous period” (the same-length window immediately before).

“Week” vs. “Days to date: 7”: Week is a fixed Mon–Sun you can step between; Days-to-date is a rolling last-7-days that always tracks today. Use Week to compare week over week, Days-to-date for an always-current snapshot.

Scope filter — managers and executives

The filter (funnel icon, left nav) scopes all content to specific people. Managers see their reporting line; executives see the whole org; individual contributors don’t have it and see only their own data. Open it and pick from four tabs—People, Teams, Cohorts, or a unified All search—then Apply, and every chart and table updates. The button reads “All N people” when clear and “N people matched” when active.

The filter persists across pages and sessions (stored ~30 days, per user, per device—so it doesn’t sync across machines and resets if you clear your browser or use incognito). “Clear selection” inside the modal removes it. One sharing caveat worth knowing: the date range travels in the URL, but the filter does not—to give a colleague the same scoped view, save the cohort as Organization-scoped and share its name.

Cohorts — reusable groups

A cohort is a saved group defined by rules—“Senior ICs,” “Backend engineers on product teams,” “new hires under six months.” Build it once and it re-evaluates automatically as people’s attributes change. In the Cohorts tab, “Add cohort” opens a builder: conditions within a rule group are AND-ed (role = IC and level in senior/staff), and separate rule groups are OR-ed (one group or another). You can also always-include or always-exclude named people regardless of rules. Attributes include people, teams, role, level, technology, location, tenure, employment type, and work mode. A live preview shows who matches as you build.

Save it as Personal (only you) or Organization (everyone can apply it, read-only unless they own it). Cohorts are the time-saver for anything recurring—two minutes to build, reused across every review thereafter.

Finding your way around

The left sidebar groups pages into Intelligence (Signals, Delivery, Quality, Efficiency, Reviews, AI Impact, Scope of Work, Talent), Organization (People, Work, Calibrate for execs, Benchmark), and Settings. Most metric pages share a shape: a Summary tab with hero cards (a headline number, a trend arrow, and “vs. previous period”) and charts, a People tab with the per-person table, and detail tabs for specific work types (PRs, tickets, documents). Breadcrumbs at the top (e.g. Delivery › People › Alice) click back to any level.

Reading the visuals: a green trend arrow means moving in the good direction for that metric, red the bad, gray no change—so “down” is green for defect rate. Quadrant scatter plots put one metric on each axis with a dot per person; top-right is strong on both, bottom-left needs a look, the off-diagonals are trade-offs. Hover any chart for exact values; click a dot to jump to that person.

The People table

The People table is the per-person view of every metric, reachable from each section’s People tab and from Organization › People. It respects the active filter and date range, so it always reflects the group and period you’ve set. Sort any column to find the top or bottom of a metric, search by name, click a row to drill into that person’s detail page, or export to CSV (top-right) for spreadsheet analysis—the export includes every column, even those hidden in the UI, plus an export timestamp.

Which columns appear, the default sort, and available groupings are set org-wide in Settings › Organization settings, not per person. Managers see their reporting line; executives see everyone; an IC sees only themselves.

ICs get their own self-view: the same metrics about their own work, plus an anonymized comparison—percentile standing against their level, team, or company (“higher than 73% of level-3 engineers”) and quadrant charts where only their own dot is named. They cannot see other individuals’ data or the People table. Worth knowing when an engineer references their own numbers, though it isn’t a manager-facing surface.

Practical habits

A few patterns pay off: start broad (all people, full team), spot the outlier, then filter in to investigate. Pair a date range with a filter for focused questions—“backend team, last quarter” is one period and one cohort away. Bookmark the views you return to (the URL carries page, tab, and date range). To compare two groups side by side, use the Calibrate page rather than flipping filters. And if a number looks stale, a hard refresh (Ctrl/Cmd-Shift-R) forces fresh data.

What Pensero Measures

31 metrics across 8 categories. Each table gives the definition and which direction is better. Direction is a guide, not a goal—read every metric against the context above and the patterns at the end of this section.

Delivery — output and capacity

Metric	What it measures	Direction
Total delivery	Story points completed in a period. Absolute output volume; use for capacity and workload.	↑ higher
Delivery per headcount	Average weekly delivery per active engineer. Normalizes output across team sizes.	↑ higher
Active headcount	Engineers with completed work in the period. Team size, not performance.	—

Quality — code quality and technical excellence

Metric	What it measures	Direction
Defect rate	Share of delivery spent fixing bugs the team introduced. Direct quality signal.	↓ lower
Rework rate	Share of delivery later rewritten. Churn, shifting requirements, or debt.	↓ lower
Revert rate	Share of delivery rolled back. Production stability and risk.	↓ lower
Duplicate code rate	Share of delivery with duplicated patterns. Debt and maintainability.	↓ lower
Code coverage	Share of code with automated tests. Testing discipline. Needs coverage integration.	↑ higher

Efficiency — speed and bottlenecks

All time-based efficiency metrics use P90 (90th percentile) to focus on typical bottlenecks, not outliers.

Metric	What it measures	Direction
Cycle time	P90 ticket-assignment to PR-merge. End-to-end delivery speed. Needs ticketing integration.	↓ lower
Time to merge	P90 PR-creation to merge. Review-process speed after code is written.	↓ lower
Time to approve	P90 PR-creation to first approval. Review responsiveness.	↓ lower
Time to comment	P90 PR-creation to first comment. Earliest responsiveness signal.	↓ lower
Waste rate	Share of PRs closed without ever merging—abandoned, superseded, or stale work.	↓ lower

AI Impact — adoption and effectiveness

Requires AI tool integration (Cursor, GitHub Copilot, etc.).

Metric	What it measures	Direction
AI-assisted %	Share of code lines written with AI assistance. Adoption and leverage.	↑ higher
AI cost	Total spend on AI coding tools, and per-engineer. Read against output and quality gains.	context
Tokens per delivery	AI tokens consumed per delivery point. Usage efficiency.	↓ lower
Delivery lift	Output multiplier vs. a prior period (e.g. 1.4×). Productivity change alongside AI use.	↑ higher
User adoption	Share of active engineers using AI tools at all. Org-wide reach.	↑ higher

Collaboration (Reviews) — review quality and knowledge sharing

Metric	What it measures	Direction
Collaboration ratio	Share of delivery on enablement (reviews, pairing, mentoring). Balance of output vs helping.	balance
PR review ratio	Review effort vs creation effort. Review coverage across the team.	balance
PR review usefulness	Share of review comments authors mark useful. Feedback quality.	↑ higher
PR addressed rate	Share of review comments that led to changes. Iterative development.	↑ higher
PR pairing rate	Share of PRs with multiple authors. Pairing and knowledge sharing.	↑ higher

Scope of Work — investment mix

Requires ticketing integration with work-type labels. Targets shift heavily by company stage.

Metric	What it measures	Direction
Roadmap alignment	Share of delivery mapped to roadmap/epics. Strategic vs ad-hoc execution.	↑ higher
New stuff %	Share of delivery on net-new features. Innovation investment.	context
Improvement %	Share of delivery on enhancements to existing features. Product polish.	balance
KTLO %	Keeping-the-lights-on: maintenance, bug fixes, tech debt.	↓ lower
Performance %	Share of delivery on performance and scalability.	context

Talent — composition and knowledge risk

Metric	What it measures	Direction
Talent density	Share of team who are high performers (“get things done”). Top 20% often drive 50%+ of output.	↑ higher
Knowledge gaps	Share of codebase with only 1–2 experts. Bus-factor and succession risk.	↓ lower

Financial — accounting and compliance

Metric	What it measures	Direction
Capitalizable %	Share of delivery qualifying for capitalization under accounting rules. New development typically qualifies; maintenance doesn’t.	—

Reading metrics together

No single metric tells the whole story. The combinations below are the ones worth recognizing on sight.

Healthy patterns

High delivery per HC + low defect rate — productive and quality-conscious.
High collaboration ratio + high delivery — strong enablement without sacrificing output.
High roadmap alignment + balanced scope mix — strategic execution, healthy portfolio.
High AI adoption + high delivery lift — AI tools are actually driving productivity.

Warning patterns

Low cycle time + high waste rate — merges are fast, but a lot of work is started and then abandoned—a planning or requirements signal.
High KTLO % + low new stuff % — debt burden is crowding out innovation.
Low collaboration ratio + high knowledge gaps — siloing is building succession risk.
High defect rate + low code coverage — testing discipline has slipped.
High delivery + high revert rate — speed is winning over stability.

Categories at a glance

Category	Metrics	Primary use	Data required
Delivery	3	Productivity, capacity planning	Git data
Quality	5	Code quality, technical excellence	Git + optional coverage
Efficiency	5	Process speed, bottlenecks	Git + optional ticketing
AI Impact	5	AI adoption and effectiveness	AI tool integration
Collaboration	5	Knowledge sharing, reviews	Git data
Scope of Work	5	Investment mix, strategic alignment	Ticketing integration
Talent	2	Composition, knowledge risk	Git data
Financial	1	Accounting and compliance	Ticketing integration

How Delivery Is Scored

Delivery is the metric most conversations hinge on, so it helps to know how it’s built. The short version: every contribution is scored as magnitude × complexity, boilerplate is filtered out first, penalties adjust for work that was undone or duplicated, and scores roll up cleanly from individual to org. Understanding this lets you interpret a score correctly and explain it to an engineer.

The formula

Delivery score = Magnitude × Complexity

Magnitude = how big the change is (size). Complexity = how hard it was (difficulty).

Complexity measures the task, not the person. A junior engineer on an architectural change earns high complexity; a staff engineer fixing a typo earns low complexity. The system credits what was built, not who built it.

Magnitude — size of the change

A non-linear T-shirt scale. A Medium isn’t twice a Small—bigger changes have disproportionate impact. Maximum is XL regardless of line count, which caps any attempt to inflate by volume.

Size	Value	Typical meaning
Zero	0.0	No relevant changes (closed PR, machine-generated code)
XXS	0.15	Single-line bug fix
XS	0.5	A few lines; documentation typo
S	1.0	Minor fix or comment updates (under ~100 lines)
M	3.0	Small feature or refactor (~50–200 lines)
L	5.0	Significant feature (~200–1,000 lines)
XL	8.0	Major feature or architecture change (over ~1,000 lines)

Complexity — difficulty of the work

Scored 1.0–3.0 by an AI assessment across five competencies, not by gut feel. It reflects the difficulty demonstrated in the work itself.

Level	Range	Meaning
Level 1	0.5–1.5	Junior work: follows existing patterns, basic implementation
Level 2	1.5–2.5	Mid–senior work: independent feature, cross-team coordination
Level 3	2.0–3.0	Staff+ work: architectural decisions, org-wide impact

The five competencies: Code (implementation quality), System Design (architecture and scale), Delivery (execution and coordination), Communications (collaboration and docs), and Ownership (decision-making and initiative). Not every competency applies to every artifact—a document has no Code score. The AI assesses a level per applicable competency, weights them by the contributor’s career level, and sums to a single complexity figure.

Worked examples

Same formula, very different shapes of work:

Architectural decision — small code, high difficulty

45 lines changing DB connection pooling, org-wide blast radius.

Magnitude S (1.0) × Complexity 3.0 = 3.0 points

Boilerplate-heavy feature — large code, low difficulty

800 lines adding endpoints across 20 files, following an established pattern.

Magnitude L (5.0) × Complexity 1.0 = 5.0 points

Critical refactor — large and hard

1,200 lines refactoring authentication, security implications, multi-team rollout.

Magnitude XL (8.0) × Complexity 3.0 = 24.0 points

How complexity sums (one PR, IC4 weights): Code L2×0.25 + System Design L3×0.30 + Delivery L2×0.20 + Communications L2×0.15 + Ownership L2×0.10 = 2.3. At magnitude L (5.0), delivery = 11.5 points.

What counts as work

Pensero scores more than code, so enablement work is visible rather than invisible. Each artifact gets its own magnitude × complexity score.

Artifact	What it captures
Pull request	Code contributions — features, fixes, refactors. Scored only when merged.
Trunk-based commits	Direct commits to main, grouped by author + day + repo into one unit.
PR review	Code-review effort: comments, approvals, review discussion.
Ticket / ticket review	Jira / Linear work items and reviews on them.
Document / doc review	Design docs, RFCs, architecture writing and feedback.
Communication	Technical discussion and Q&A (requires integration).

Why it matters: senior engineers get credit for reviews and mentorship, technical writers show up in delivery, and collaboration is rewarded rather than penalized. Open PRs are tracked for visibility but don’t score until merged.

How each artifact is scored

Each artifact type has its own rules for when it's scored and how credit is split when several people contribute. The differences matter when you're explaining a number to someone.

Pull requests score on merge (or on close), never while open — the delivery date is the merge date, not when the PR was opened. When a PR has multiple authors, Pensero splits it by git blame: it works out who authored each changed line, converts that to each person's share of the total lines changed, and splits both magnitude and delivery proportionally, giving each contributor their own artifact. So a 10-point PR that's 70/30 by lines becomes 7.0 and 3.0 points. This gives fair credit for pairing without one person banking the whole PR.
Documents (design docs, RFCs, wiki pages) score each time they change, but with a 30-day window to prevent edit-padding. Update a doc within 30 days of your last change and the existing artifact is updated in place — its date moves forward and the score recalculates, no new artifact. After 30 days the doc is treated as complete: the artifact freezes and later edits don't create new ones. Multi-author docs split by blame analysis, same as PRs.
Tickets score ticket creation — specifically the quality of how the work is defined (clear requirements, acceptance criteria, context), not its assignment or completion. This is deliberately separate from the work that follows: the person who writes a thorough user story earns ticket-artifact credit, and the person who implements it earns PR credit. Two distinct contributions, scored independently, so good planning is recognized and nothing is double-counted.
Communications (Slack, Teams, Google Chat) score technical discussion tied to real work — messages linked to a merged PR, a completed ticket, or a document, or replies in such a thread. Casual chat, scheduling, and status updates are excluded. Messages are batched hourly and grouped one artifact per thread, per person: all of your messages in a thread combine into a single artifact dated to the thread's start, updated if you add more. Each participant is scored separately for their own contributions. What's explicitly not measured: message count, response time, or reactions — volume isn't the point, substance tied to work is.

Boilerplate filtering

Raw line counts can be inflated by code nobody really authored, so Pensero strips boilerplate before sizing the change: lock files, generated schemas (Protobuf, GraphQL, OpenAPI), migrations and compiled assets, large test fixtures, and whitespace-only reformatting. Magnitude is then assessed on the cleaned diff.

Example

A 770-line PR = 450 lines of package-lock.json + 200 lines of test fixtures + 120 lines of real feature code.

After filtering, only the 120 feature lines count—so it scores as Medium, not Large.

This cuts both ways: it stops volume-padding, and it also means necessary boilerplate committed alongside real work doesn’t unfairly inflate or distort the score.

Penalties

Penalties adjust scores so they reflect delivered value, not raw activity. They’re additive but capped at 100% (a score can’t go below zero), and they’re recalculated as new PRs merge—which is why scores can shift after the fact.

Penalty	Amount	When it applies
Revert	100%	PR reverts a previous PR. Both the original and the revert lose credit.
Duplicate code	100%	PR duplicates another PR’s code (e.g. same branch to a second target).
Duplicate document	100%	Document duplicates another’s content.
Release merge	Partial	PR merges other PRs; only net-new work counts, credit stays with the originals.

Rework and bugfix deductions exist in the system but are currently disabled for customers.

Reading penalties as a manager

Revert signals a quality or testing gap—a coaching moment, not a scoring quirk: “what testing would have caught this before production?”
Release merge and duplicate-code penalties are normal in feature-branch and multi-environment (dev → staging → prod) workflows. They just prevent double-counting; no concern.
Duplicate-document penalties catch copy-paste gaming and also accidental uncoordinated duplicate design docs.

False positives can be flagged for review and overridden by admins in rare cases.

How scores roll up

Aggregation is simple summation with no tricks. Artifact scores (already post-penalty) sum to the individual total, and individuals sum to the org total. Work is attributed to its original owner; reviews to the reviewer; collaborative work like multi-authored PRs is split automatically. Only completed artifacts in the date range count—merge date for PRs, completion date for tickets, publish date for documents.

Timing to set expectations with your team: an artifact scores within minutes of merge, then a background job checks for penalties over the next day or two and may adjust the score retroactively. Day-to-day numbers fluctuate; weekly is the right cadence to read delivery.

Reading delivery well

The healthy and warning signals below matter more than any single number.

Healthy signs

Steady week-to-week rather than erratic.
A balanced artifact mix—not 100% PRs, but reviews and docs too.
Complexity that tracks seniority (a senior’s work shows higher complexity than a junior’s).

Warning signs

Wide variance within a team (some carrying far more than others) — load imbalance or skill gap.
A sustained declining trend — process friction or mounting tech debt.
Very low review contribution across the team — siloing.

Context first: compare engineers at similar levels, account for team type (platform teams deliver fewer, higher-complexity points than product teams), and remember some real work—verbal mentorship, on-call, dev-tooling—isn’t captured at all.

Common questions

Can engineers game their score?

It’s difficult and self-defeating. Boilerplate filtering removes padding, magnitude caps at XL no matter the line count, complexity is assessed rather than counted, and penalties strip duplicated or reverted work. Because the system spans many metrics, gaming delivery usually degrades quality or efficiency. Trivial-change spam stays XXS × Level 1—effectively worthless. The real defense is cultural: when scores drive support rather than punishment, the incentive to game disappears.

What about glue work that doesn’t show up?

Reviews, RFCs, design docs, and technical discussion are scored, and high-complexity unblocking work scores well. But verbal mentorship, process improvements like CI/CD, and on-call firefighting aren’t captured yet. So when delivery looks low, ask: “are you doing glue work or mentorship that isn’t captured?” Treat the metric as the start of the conversation.

Why did the score change retroactively?

Penalties run asynchronously after merge. A score can appear immediately, then drop a day later when overlap with another PR, a revert, or a duplicate is detected. This is expected—check weekly, not daily.

How do I explain a low score to an engineer?

Gather context first: check the artifact mix (mostly reviews?), recent penalties (reverts or duplicates?), complexity (strategic or architectural work takes longer), timing (ramping on a new project?), and external factors (PTO, on-call, incidents). Then frame it with curiosity, not accusation—“I see your delivery is lower than usual, what’s taking up your time and how can I help?” Metrics reveal symptoms; the conversation finds the diagnosis.

How Quality Is Measured

The five quality metrics are defined in What Pensero Measures. This section covers what sits underneath them—how Pensero detects a bug, traces it to the PR that caused it, and spots rework, reverts, and duplication automatically. Knowing the detection logic lets you trust the numbers and explain them without sounding like you’re assigning blame.

Detection runs automatically on every merge

When a PR merges, Pensero analyzes it within minutes: it decides whether the PR fixes a bug, traces a bugfix back to the change that introduced it, compares the diff against recent PRs to catch rework, reverts, and duplicates, and pulls in test-coverage data if a coverage tool is connected. Results land in the quality metrics within hours, and historical data updates if a bugfix links back to an older PR.

Consistent and bias-free: the same analysis runs for everyone, engineers can see their own metrics and why they moved, and the focus is the trend over time—bugs happen, what matters is the direction.

How bugs are traced to their source

The hard question is: when a bugfix lands, how does Pensero know which PR caused the bug? It identifies the files and lines the fix touches, walks git history to find which PR last changed those lines, and links the two—defect introduced ↔ defect fixed.

Example

March 1: Jane merges PR #100 adding authentication logic.

March 15: Tom merges PR #150, “Fix login crash,” touching lines 45–60 of auth.py.

Those lines trace back to Jane’s PR #100. Jane’s defect rate reflects the source; Tom gets credit for the fix.

So engineers are credited for fixing bugs, see the impact when their own code introduces one, and the org sees the total cost of defects—all without a manager having to adjudicate.

How each metric is detected

Defect rate — bugs the org introduced

AI reads each PR’s title, description, and diff to decide whether it fixes broken behavior—a crash, wrong result, logic error, or previously-failing functionality. It’s deliberately conservative: feature work, refactoring, performance tuning, security hardening, config changes, and dependency bumps don’t count. It then isolates which lines are the actual fix versus unrelated cleanup, so only the bug-fixing share of the PR counts toward the defect signal.

Rework rate — rewriting recent code

Pensero compares each PR line-by-line against PRs merged in roughly the last month in the same repo. When a high share of lines a recent PR added are now being removed or replaced, that’s rework—lower overlap is just normal evolution. The useful distinction for conversations: rework is replacing code because the first attempt was wrong (often from unclear requirements); refactoring improves structure while preserving behavior and shows up as improvement work, not rework. The metric is built to separate the two.

Revert rate — rolled-back PRs

Detected from revert/rollback patterns in titles and git metadata, then verified by checking that the revert PR’s removed lines closely match the original’s added lines. This is a count-based signal (a PR was either rolled back or it wasn’t), so size doesn’t factor in. A revert is the most severe quality signal—code that couldn’t be fixed incrementally and had to be undone.

Duplicate code rate — copy-pasted code

Content from each changed line is normalized (whitespace and formatting ignored), hashed, and matched against earlier PRs. A high share of identical-hash lines flags duplication. Because it’s exact content matching, renaming variables or reformatting won’t hide it—and won’t cause false hits from formatting alone.

Code coverage — test coverage

Coverage comes from an external tool (Codecov today; others planned). On merge, Pensero fetches the diff coverage—the share of new lines tested—and rolls it up weighted by PR size, so a large under-tested PR moves the number more than a tiny fully-tested one. Without a coverage integration this metric shows N/A. One caveat worth repeating to teams: coverage shows what’s tested, not whether the tests are any good.

Reading the metrics together

As with delivery, combinations tell the story. The pairings below map a metric pattern to its likely cause—use them to read a team at a glance.

Pattern	Likely meaning
Low defect + high coverage	Testing discipline is working.
High defect + low coverage	Insufficient testing; bugs slipping through.
High rework + high duplicate	Poor code organization—messy, hard to maintain, frequently rewritten.
High rework + high revert	Unstable requirements or rushed code.
Quality sliding across consecutive periods	Compounding pressure: ramping hires, deadlines, or accruing tech debt.

Context still governs. High rework is normal for an early-stage product finding fit and expected briefly after a big refactor, but concerning in a mature product. A blanket coverage mandate tends to backfire—engineers write hollow tests to hit the number. Better to require coverage on critical paths (auth, payments, data integrity) and watch the trend.

The delivery-vs-quality quadrant

Pensero plots a quadrant chart—delivery on the x-axis, a quality score (the inverse of defect rate) on the y-axis—to show the trade-off between shipping and stability at a glance. The four quadrants:

Quadrant	What it suggests
High delivery, high quality	Sustainable practices—shipping steadily without introducing bugs.
High delivery, low quality	Shipping fast but introducing defects—often a sign of pressure to ship without enough testing.
Low delivery, high quality	Careful but slow—may reflect genuinely complex work or an overly cautious culture.
Low delivery, low quality	Where support is most likely needed—worth understanding why before concluding anything.

Read the quadrant at the population level, not as a verdict on individuals. A cluster in “high delivery, low quality” points at a systemic incentive to ship without testing; a cluster in “low delivery, high quality” may signal work that’s harder than it looks or a culture that’s over-indexing on caution. Use a person’s position as the opening of a conversation—“what’s the work in this area like right now?”—rather than a label. The same caveats from delivery scoring apply: complex and enablement work shifts where someone lands.

Using quality data in conversations

Lead with curiosity, not a verdict. “Your defect rate is higher than usual—what’s going on, and where do you need support?” surfaces the real cause; “your defect rate is unacceptable” just creates defensiveness. Common root causes worth probing: unfamiliar or complex area (pair with a senior), no time for tests (revisit priorities and deadlines), an edge case the engineer didn’t recognize as a bug (testing guidance), or shifting requirements (escalate to product for clarity).

In the product, org-wide rates and trend charts live on the Quality page; per-person quality sits in the Quality → People tab, sortable and drillable for 1-on-1 prep. Use it to find who might need support, not to rank people.

Can engineers game quality metrics?

Mostly no. Defect detection reads PR content, so relabeling a bugfix as a “refactor” doesn’t fool it. Rework and duplication are line-level content matches—renaming or reformatting won’t hide them. Reverts come straight from authoritative git history.

The one soft spot is coverage: you can write tests that exercise code without really asserting anything. The defense is review—check test quality, not just the percentage.

How Reviews & Collaboration Are Measured

The five review metrics are defined in What Pensero Measures. This section covers the mechanics underneath—how a review comment gets judged useful or not, how Pensero knows whether feedback was acted on, and how it detects pairing—plus the collaboration patterns worth recognizing. Reviews are where knowledge sharing and peer quality-control happen, so these signals tell you whether your org is learning together or working in silos.

What counts as review work

Review (or enablement) work is anything that helps someone else ship: reviewing PRs, documents, or tickets; co-authoring through pairing or mobbing; and technical help in messages and Q&A. Approving your own PR and creating your own work don’t count. The metrics weigh this enablement effort against creation effort—the point isn’t to maximize either, but to see the balance.

How the harder metrics are detected

Review usefulness — is the feedback substantive?

AI classifies every review comment. It counts as useful when it flags a defect (bug, bad logic, security issue), a weakness (performance problem, missing edge case), or a concrete suggestion for a better approach. It’s noise when it’s a bare “LGTM,” a style nitpick, or a non-technical aside. The metric is the useful share—so it measures whether reviews catch real issues or just rubber-stamp.

Addressed rate — was the feedback acted on?

Pensero checks whether the author pushed changes after a comment: fully addressed, partially addressed, ignored, or undetermined (e.g. a comment landing right before merge). Only useful comments count toward the rate—noise is excluded—and it’s weighted by severity, so resolving a critical bug matters more than a style note. The revealing combination is usefulness and addressed rate together: useful feedback that’s consistently ignored is a cultural signal, not a tooling one.

Pairing rate — collaborative authorship

A PR counts as paired when it has multiple commit authors, Co-authored-by tags, or other evidence of joint work. Pairing is deliberately expensive—two people on one task—so the goal isn’t a high number but the right uses: onboarding, unfamiliar or complex problems, and spreading knowledge in critical systems.

Collaboration patterns to recognize

The patterns matter more than any single rate. Each below is a combination you can spot at a glance and what it tends to mean.

Pattern	What it looks like and means
Healthy collaboration	Balanced creation and review, useful feedback that gets acted on, some pairing. Knowledge moves and quality holds.
Siloing risk	Very low review and pairing, too few reviews to even measure quality. Knowledge concentrates—bus-factor and coverage risk if someone leaves or takes leave.
Review theater	Lots of review activity but low usefulness and low addressed rate. The motions of review without the value—“LGTM” culture, feedback ignored, a false sense of safety.
Over-collaboration	Review far exceeds creation, constant pairing. Feedback quality may be high, but delivery drags under process overhead; can be right during big refactors or heavy onboarding.

Review theater is the one most worth watching for: high activity makes it look healthy on a dashboard, but low usefulness plus low addressed rate means the review process is a checkbox, not a quality gate.

The delivery-vs-reviews quadrant

A parallel to the delivery-vs-quality quadrant: delivery on the x-axis, enablement (review) activity on the y-axis. It surfaces collaboration style at a glance.

Quadrant	What it suggests
High delivery, high reviews	Multipliers—shipping their own work and lifting others. Seniors often land here.
High delivery, low reviews	Solo contributors—shipping independently; watch for knowledge-sharing risk.
Low delivery, high reviews	Enablers—focused on helping others; appropriate for some roles, worth a check for juniors.
Low delivery, low reviews	May be blocked or need support—understand why before concluding.

Read it at the population level. A senior sitting in “solo contributor” may need a nudge on mentorship expectations; a junior parked in “enabler” may be reviewing when they should be building; a wide healthy spread is a balanced culture. As always, position opens a conversation rather than settling one—role and project phase move where someone lands.

Reading review ratios in context

There is no universal “right” review ratio—it tracks role, seniority, and project phase. Seniors run higher (they’re multipliers, mentoring and unblocking); juniors run lower while they learn and build; onboarding periods push everyone’s ratios up. Rather than mandating a number, watch the quality signals (usefulness, addressed rate) and the trend. A low pairing rate isn’t a problem on its own—but low pairing plus low review ratio plus high knowledge gaps is a real siloing signal.

In the product, org-wide review metrics and trends are on the Reviews page; per-person collaboration sits in the Reviews → People tab for spotting who’s siloed or over-extended before a 1-on-1.

Common questions

Should I mandate code reviews?

Most orgs already require an approval before merge; the real question is whether review is effective. Effective review shows up as substantive feedback that gets acted on without overwhelming people’s time. Ineffective review is rubber-stamping, ignored feedback, or so much review activity that it becomes pure overhead. Measuring usefulness and addressed rate—and recognizing people who give valuable reviews—beats mandating a quota.

Is a low pairing rate bad?

Not by itself. Pairing earns its cost on onboarding, unfamiliar or complex problems, and critical-system knowledge spread; solo work is fine for well-understood tasks and clearly-owned parallel work. It only becomes a red flag alongside low review ratio and high knowledge gaps.

Review usefulness looks low—what now?

Investigate before reacting. Low usefulness usually means people are rushing reviews, an “LGTM” norm has set in, reviewers lack the context to add value, or PRs are too large to review well. Sampling the low-usefulness comments reveals which. The fixes follow the cause: coach on what a good review catches (logic, not just style), encourage smaller focused PRs, and route PRs to reviewers with the right domain knowledge.

How Scope of Work Is Categorized

The six scope-of-work metrics are defined in What Pensero Measures. This section covers how every piece of work gets sorted into a category in the first place, and how to read the resulting investment mix. Think of scope of work as your engineering investment portfolio—where time actually goes, which is often not where you assume it goes.

How work gets classified

When work completes—a PR merges, a ticket closes, a document publishes—AI reads its content (title, description, commit messages, ticket labels) and sorts it into one of five categories. The signals it weighs are keywords (“add,” “fix,” “optimize,” “refactor”), structure (new files vs. modified files), ticket labels (bug, feature, enhancement), and the semantic purpose of the change.

Category	What lands here
New stuff	Building something that didn’t exist—new features, modules, endpoints, services.
Improvement	Enhancing something that already exists—adding to or refining a current feature.
KTLO	Keeping the lights on—bug fixes (any severity), security patches, tech debt, dependency updates, infra maintenance.
Performance	Scalability and optimization—caching, query tuning, sharding, architecture for scale.
Other	Work that doesn’t fit the above.

New stuff and improvement are the easy pair to confuse: “add user search” is new stuff (it didn’t exist); “add filters to user search” is improvement (enhancing what’s there). The five shipping categories sum to 100% of delivery.

Capitalizable % is derived, not classified. It’s simply new stuff plus improvement—the development that creates new assets (CapEx) versus maintenance and operations (OpEx). No separate tagging; it falls out of the categories above, which is what makes it audit-friendly.

Roadmap alignment is configured, not inferred

Unlike the five categories, roadmap alignment isn’t an AI judgment about the nature of the work—it’s about whether work connects to your strategic tracking. Each org configures what “aligned” means: specific epic IDs, projects or initiatives, custom labels like “roadmap,” certain ticket states, or bespoke query logic matching your planning process. Ad-hoc bug fixes, unplanned experiments, incidents, and untagged work fall outside it. So roadmap alignment measures planning discipline—executing the plan versus firefighting—rather than what kind of code was written.

Reading the investment mix

The mix is a portfolio, and the patterns matter more than any single percentage. Three worth recognizing:

Pattern	What it looks like and means
Balanced portfolio	Most work tied to roadmap, real innovation capacity, steady polish, manageable maintenance. Executing strategy while holding quality.
Tech-debt crisis	Maintenance consuming a large share, roadmap alignment and new-stuff both depressed. Firefighting mode—little room to innovate, and capitalizable % drops with it.
Premature optimization	Heavy performance investment with low innovation for the stage. If user volume is low, scale work is stealing from features you actually need.

Stage governs what “good” looks like—see the stage cues in the glossary. Broadly, new-stuff share starts high at MVP and declines as a product matures, while KTLO and improvement rise. Performance should be low pre-scale (don’t optimize prematurely) but climbs when you’re actually scaling—and low performance investment during hypergrowth is the warning sign, not low investment early on. The same number means opposite things at different stages.

Using scope data well

The highest-value move is comparing your actual mix against your intended one, then acting on the gap: “we’re at 38% KTLO, so let’s ring-fence 20% of next sprint for debt reduction.” Trends beat snapshots—a KTLO number climbing month over month, or new-stuff steadily shrinking, says more than any single reading. Org-wide mix and trends live on the Work page; per-person category breakdowns are in the Work → People tab.

Roadmap alignment as an early signal: a sharp drop usually has a concrete cause—a major incident, a wave of customer escalations, urgent tech debt, or planning that slipped. It’s worth a “what pulled us off-plan?” conversation rather than a verdict on the team.

Common questions

How accurate is the AI categorization?

Generally high—it’s trained on many examples, draws on multiple signals (title, description, labels, code), and is conservative when uncertain. You can click any work item to see its category and reasoning, and admins can override edge cases (the original classification is preserved for audit). Overrides should stay rare exceptions; routine re-tagging undermines the consistency that makes the data trustworthy.

What about mixed or ambiguous work?

Edge cases like “refactor for performance” (KTLO or performance?) or “fix a bug and add a feature” (KTLO or new stuff?) are resolved by primary intent—the AI picks the category reflecting the bulk of the effort. It’s a reasonable default, and the per-item view lets you check anything that looks off.

Is a high KTLO number bad?

Depends on whether it’s a spike or a pattern. A one-month jump from incident response, security patching, or a deliberate debt sprint will normalize next period. The same level sustained across several months points to a systemic quality or debt problem—cross-check the defect rate, and consider a dedicated debt-reduction sprint plus upstream testing to stop bugs at the source.

How Efficiency Is Measured

The five efficiency metrics are defined in What Pensero Measures. This section covers two things: the statistical views Pensero offers for the time metrics—P90, P80, median, and average—and why the default is P90, and, more usefully, how to read the four time metrics as a sequence to pinpoint exactly where delivery is getting stuck. Efficiency is about finding where time leaks, not pushing people to move faster.

Why P90, and the other statistical views

All four time metrics default to the 90th percentile: 90% of work finishes within the stated time. But Pensero lets you switch the view — P90, P80, Median (P50), or Average — from the dropdown on efficiency charts, and each answers a slightly different question.

Averages get wrecked by a single stalled PR. Four PRs that merge in 2–5 hours plus one stuck for a week average out to ~36 hours, which describes none of them. The median lands at 4 hours and P90 at 5 — both far closer to what engineers actually experience. P90 strips the legitimate outliers (genuinely complex edge cases) and surfaces the systemic delay you can act on. It's also the DORA-standard approach and far less volatile when comparing teams or periods.

When to reach for each:

P90 (default) — finding systemic bottlenecks, comparing teams or periods, setting SLAs ("90% of PRs merge within a day").
P80 — a slightly tighter read when P90 feels too lenient.
Median (P50) — the typical engineer's experience, when outliers are genuinely exceptional rather than systemic.
Average — overall trend, most useful read alongside P90 to understand the distribution.

The gap between views is itself the signal. A large average-to-P90 gap means a long tail — most work is fast, a few items drag. To tell "occasional outliers" from "everything is slow," compare median to P90: if the median sits close to the average, the distribution is skewed by a handful of outliers; if the median is close to P90, most PRs really are slow. That distinction changes the fix — chase the few stuck PRs, or rework the process itself.

A large gap between P90 and average isn’t a problem—it just means the distribution has a long tail. That’s the signal P90 is designed to expose, so you can go fix the tail.

The four time metrics, and what sits between them

Each metric measures from PR creation (or ticket start) to a later moment. Read end to end, the gaps between them tell you where the time goes.

Metric	Measures from → to	The gap before it represents
Time to comment	PR created → first comment	How long until anyone looks at it (attention).
Time to approve	PR created → first approval	Comment → approve: how long the actual review takes.
Time to merge	PR created → merge	Approve → merge: CI, merge queue, deployment steps.
Cycle time	Ticket assigned → PR merge	The full end-to-end, including time-to-start-coding before the PR existed.

Cycle time needs ticketing (Jira, Linear) with start timestamps and PRs linked to tickets. The three time-to-* metrics and waste rate come from git data alone, so they’re available even without a ticketing integration.

Diagnosing a bottleneck from the progression

Because the metrics nest, the shape of the progression tells you where to look. Match your numbers to one of these:

Where the jump is	Diagnosis	What to try
Fast comment, slow approve	Review itself is the bottleneck—PRs get noticed but approval drags.	Add approved reviewers; revisit strict multi-approval policies; check for a single-approver chokepoint.
Slow comment and approve	PRs sit idle—nobody looks until late.	Review rotation, PR notifications, assign reviewers at creation, budget review time into sprint capacity.
Fast approve, slow merge	A big approve→merge gap points outside review—CI, merge queue, or deployment.	Profile and parallelize CI, check merge-queue policy, automate manual deploy steps.
Fast throughout, low waste	Healthy process—maintain it.	No action needed.

Waste rate — work that never shipped

Waste rate is the share of PRs closed without merging—code written but never delivered: abandoned experiments, wrong-direction work after requirements shifted, duplicates, and stale PRs nobody reviewed. Open PRs (still in progress) and merged PRs don’t count, and merged-then-rewritten work is rework, which lives in the quality metrics, not here.

It measures planning quality, not code quality. Some waste is healthy—experiments fail, prototypes get discarded, startups pivot on feedback. The question to ask is why PRs are closing: intentional learning, or dysfunction (premature starts on unclear requirements, duplicated work from poor communication, forgotten stale PRs)? An innovation team running many experiments will and should waste more than a mature product team with clear requirements.

Reading efficiency in context

Compare like with like—team type sets the baseline. Product teams move fastest and waste more (simple changes, heavy experimentation). Platform teams are slower by nature (complex changes that affect many teams need careful review) and waste less (more planned work). Infrastructure / SRE sits in between, with careful review because production impact is high. Comparing a platform team’s cycle time to a product team’s and concluding the platform team is “slow” is the classic misread.

Org-wide P90s with trend indicators are on the Efficiency page; per-person times are in the Efficiency → People tab. As everywhere, the trend matters more than the absolute—pick one bottleneck, change one thing, and watch whether the number moves.

Common questions

Can we improve speed without hurting quality?

Yes—the two aren’t opposed. The efficiency gains that help quality are exactly the ones to pursue: faster reviews catch issues sooner, clear requirements cut rework, parallelized CI speeds merges without risk, and better communication prevents duplicate work. What you don’t do is buy speed by skipping review, cutting coverage, or merging without approval—that just moves the cost into the quality metrics.

What if we don’t use tickets?

You lose cycle time (it needs ticket start timestamps), but the three time-to-* metrics and waste rate still work from git data alone. GitHub Issues/Projects can stand in if configured; ad-hoc spreadsheets won’t integrate.

How AI Impact Is Measured

The AI Impact metrics are defined in What Pensero Measures. This section covers where the data comes from and how to read adoption against effectiveness. The core idea: track both whether people are using AI tools and whether the usage is actually helping—high adoption with poor results means training, not celebration.

Which tools are tracked, and how

Pensero integrates with the major AI coding assistants—Cursor, GitHub Copilot, Claude Code, and Gemini Code Assist—via OAuth or API keys. It pulls usage daily (lines generated, tokens, cost), attributes that usage to specific commits and PRs, and rolls it up. This requires admin access to the AI-tool accounts, credentials configured in Pensero, and each engineer’s AI-tool account linked to their Pensero profile. Without those links, usage can’t be attributed.

Adoption vs. effectiveness

The metrics split into two questions. Adoption—are people using AI?—is captured by user adoption (the share of active engineers using AI at all) and AI-assisted % (the share of merged lines that were AI-generated). Effectiveness and cost—is it worth it?—is captured by tokens per delivery (efficiency, lower is better), AI cost (total spend), and delivery lift (output change versus a prior period). Read together, they tell you whether spend is translating into reach and into output.

Reading	What it suggests
High adoption, rising delivery lift	Tools are landing—reach and output both moving.
High cost, low AI-assisted %	Paying for capacity that isn’t being used—check access, training, unused licenses.
High AI-assisted %, climbing tokens per delivery	Usage is getting less efficient—over-long prompts, repeated retries, or AI used where it isn’t needed.
Low user adoption after many months	Barriers, not early days—access, awareness, or cultural resistance.

How the key figures are derived

Each AI tool reports lines generated per user per day; Pensero attributes those to commits, then PRs, then the org. AI-assisted % is AI-generated lines over total merged lines. Tokens per delivery is total tokens consumed over delivery points—a token being roughly four characters of prompt-in or code-out, so longer prompts and more retries cost more. User adoption counts active engineers (those who merged at least one PR) with any AI usage. Delivery lift compares this period’s delivery to a prior period’s.

Treat delivery lift carefully: correlation isn’t causation. It shows output moved during AI adoption, but headcount changes, simpler work, or seasonality can all contribute. Use it as a signal to investigate, not as proof on its own—and compare like with like (same team, same kind of work) before drawing conclusions.

Reading AI metrics in context

AI-assisted % varies legitimately by work and person—UI and greenfield code lean higher, complex algorithms and legacy refactors lower; juniors often lean on it more than seniors. Adoption also takes time to build, so a low number in the first months is early days, while the same number a year in signals a barrier worth investigating. Org-wide AI metrics and trends live on the AI Impact page; per-person figures are in the AI Impact → People tab.

Common questions

Is high AI usage always good?

Not on its own. Usage is a means, not the goal—what matters is whether it produces quality output efficiently. Pair AI-assisted % with the efficiency and quality signals: high usage alongside climbing tokens per delivery or rising defects means someone is leaning on AI without getting clean, efficient results. The move there is coaching, not a usage target.

Can we cut cost without losing value?

Usually yes. Tokens per delivery surfaces inefficient usage—whoever’s burning far more than the team norm is often getting poor suggestions and retrying, which prompt-engineering coaching fixes. AI isn’t the right tool for every task (simple edits and boilerplate rarely need it), and user adoption reveals paid-but-unused licenses to reclaim. High aggregate usage is also leverage for volume pricing.

What if AI makes engineers over-reliant?

Watch for high usage paired with degrading quality signals—more reverts or rework. Healthy AI use looks like an engineer who still understands and can explain their code, reviews suggestions critically, and uses AI to accelerate the obvious rather than to skip the thinking. The cultural framing that works: AI accelerates the routine so you can spend judgment on the hard parts—it doesn’t replace the judgment.

Comparison Tools: Benchmark & Calibrate

Two executive-only pages answer different comparison questions. Benchmark asks “how are we doing versus the industry, over time?” Calibrate asks “how do these groups compare to each other, right now?” The quick rule: Benchmark for trends and board context, Calibrate for talent and resourcing decisions.

Both surface metrics already defined in What Pensero Measures. One naming note: Benchmark and Calibrate label new-feature investment “Innovation rate,” which is the same metric the glossary calls New stuff %.

Benchmark — you vs. the industry

Benchmark (Organization › Benchmark) plots your org against the anonymized median of all Pensero organizations, on a 0–100 percentile scale: 50th is the industry median, 75th means you’re ahead of three-quarters of organizations, 25th puts you in the bottom quarter. It shows 26 weeks of history per metric as a trend line, so you see direction as well as position. Higher is better for most metrics—but defect rate, cycle time, and knowledge gaps invert (lower ranks higher). The summary view lists all ten metrics with current percentiles; clicking any one jumps to its detail page.

It covers ten metrics: delivery per headcount, defect rate, AI-assisted code, collaboration, innovation rate (new stuff %), roadmap alignment, cycle time, capitalizable %, talent density, and knowledge gaps. The comparison is against all Pensero organizations regardless of size, industry, or stage—there’s no segment filter yet, which is worth remembering before reading too much into a single percentile.

Reading it: a percentile is a question, not a verdict. Below the median can be entirely appropriate—foundational investment that depresses short-term delivery, a ramping team, or a deliberate quality-over-speed posture. Use it to ask what’s driving the position.

Calibrate — groups side by side

Calibrate (Organization › Calibrate) is a matrix: metrics down the rows, comparison groups across the columns, each cell color-coded. Two columns are always present—Company (your whole org) and Industry (the Pensero median)—and you add up to ten custom columns for any mix of individuals, teams, saved cohorts, or filters (e.g. “backend developers hired in 2025”). It’s built for talent-calibration sessions and resourcing decisions.

The color coding is relative to both baselines at once, which is what makes it readable at a glance:

Cell color	Meaning
Dark green	Better than both company and industry—excelling; worth understanding what they do differently.
Light green	Better than company but below industry—strong internally, room to reach industry-leading.
Light red	Below company but above industry—your org is high-performing overall; this group lags it.
Dark red	Below both—investigate: blockers, priorities, or training need.
Gray	Not enough data to score (low activity, empty group, or metric not applicable).

Calibrate shows eleven metrics—the benchmark ten plus active headcount (shown for context, not color-coded), and AI user adoption in place of AI-assisted code. Add columns via “Add cohorts,” hover any cell to see the underlying value and highlight its row and column, and remove columns from their header. There’s a ten-column ceiling; to compare more groups, swap columns across sessions or compare cohorts instead of individuals. Note the matrix can’t be exported yet—screenshot or copy values to share.

Color is relative, so context still governs. A dark-red cell can be exactly right—a newly-formed team ramping, or a group deliberately on harder problems. Read it as “where should I look,” not “who’s failing.”

Which to use

If you need to…	Use
Track org trends over time	Benchmark
Compare to the industry	Benchmark (Calibrate has it as a reference column)
Compare teams or individuals side by side	Calibrate
Run a talent-calibration session	Calibrate
Build a board presentation	Benchmark
Make a resource-allocation decision	Calibrate
Do historical analysis	Benchmark

Applied Scenarios

This is the section to reach for when you’re facing an actual situation. Every scenario below makes the same point in a different way: no single metric tells you what’s going on. A number that looks alarming on its own usually becomes legible—or flips meaning entirely—once you read it alongside two or three others. Each case shows the misleading single-metric read first, then the fuller picture, then what to do.

The “low performer” who is actually the team’s glue

Situation: An engineer’s delivery is well below the team’s. On a delivery-only view they look like an underperformer, and a quarterly ranking would put them at the bottom.

The single-metric trap: Reading delivery alone, you’d coach them to “ship more”—or worse, flag them in a review.

The fuller picture: Pull their collaboration and quality signals next to delivery. Their collaboration ratio is high, PR review usefulness is among the team’s best, and knowledge gaps in the areas they touch are low. On the delivery-vs-reviews quadrant they sit in enabler—low delivery, high enablement. They’re the person unblocking everyone else, reviewing the hardest PRs, and spreading knowledge that keeps the bus-factor down. Their low personal delivery is the cost of lifting everyone else’s.

Action: Recognize the enablement explicitly—it’s real work the delivery number doesn’t capture. The risk here isn’t under-performance; it’s that this person is invisible to a metrics-naive review and may burn out or leave. If anything, check the inverse: is the team leaning on them so heavily that their own growth is stalling?

The high deliverer who is quietly a risk

Situation: An engineer tops the delivery chart, quarter after quarter. The obvious read is “star—promote and replicate.”

The single-metric trap: Delivery alone says star performer. But output is only one dimension.

The fuller picture: Their collaboration ratio is near zero, PR pairing rate is negligible, and the knowledge gaps metric shows they’re the sole contributor across a large share of the codebase. They’re a solo contributor on the quadrant: shipping fast, sharing nothing. If their defect or revert rate is also creeping up, the speed may be coming at quality’s expense. The high delivery is real—but it’s concentrating critical knowledge in one person and building succession risk.

Action: Don’t punish the output—redirect some of it. Pair them with others, route reviews through them so knowledge spreads, and make enablement an explicit expectation if they’re senior. The single-metric celebration would have deepened exactly the risk you most want to avoid.

The team working hard but not shipping

Situation: A team is putting in long hours but output is flat, and leadership is asking whether they’re productive enough.

The single-metric trap: Low delivery per head looks like a performance or staffing problem—the instinct is to push harder or question headcount.

The fuller picture: Efficiency tells a different story. Cycle time and time to merge are high, and the bottleneck progression points to review: fast time-to-comment but slow time-to-approve. The PR review ratio shows two people carrying nearly all reviews.

The team isn’t unproductive—its work is stuck in a review queue because review load is concentrated on a couple of engineers. Delivery is low because finished code is waiting, not because people aren’t working.

Action: Fix the process, not the people: spread reviewer load, set a review-response expectation, and watch cycle time and time-to-approve fall. Pushing the team to “work harder” would have made the real bottleneck worse.

A quality regression with a hidden cause

Situation: Customers are reporting more bugs and support is escalating. You need to know the scope and the cause quickly.

The single-metric trap: Defect rate is up—so the easy conclusion is “engineers got sloppy,” which invites blame and fixes nothing.

The fuller picture: Triangulate. Defect rate has jumped, but isolate it by team and it’s concentrated in one group. There, code coverage has dropped and PR review usefulness has fallen—reviews stopped catching issues. Cross-reference scope of work: that team’s KTLO % spiked and roadmap alignment dropped, meaning they were buried in reactive work. And the People view shows two senior engineers were out, leaving less-experienced members merging unreviewed code. The regression isn’t sloppiness—it’s a team overwhelmed by maintenance with its safety nets temporarily down.

Action: Treat the cause, not the symptom: pause feature pressure on that team, bring in review support while the seniors are out, restore a coverage expectation. The defect-rate number was the smoke; the cause only appeared when you read coverage, reviews, scope, and staffing together.

Defending (or questioning) the AI tool spend

Situation: Finance wants to know whether the AI coding tools are worth the spend.

The single-metric trap: Pointing only at high AI-assisted % proves usage, not value—and pointing only at delivery lift invites the fair objection that other things changed too.

The fuller picture: Read adoption, effectiveness, and quality together. User adoption and AI-assisted % show the tools are actually used, not shelfware. Tokens per delivery shows whether that usage is efficient. Crucially, check whether defect and revert rates held steady as AI usage rose—adoption with stable quality is the real signal that the tools help rather than just generate more code to clean up. Delivery lift is supporting evidence, treated as correlation, compared like-for-like.

Action: Bring the combination, not one headline number, and state the honest caveat on delivery lift. If usage is high but tokens-per-delivery is poor or quality slipped, that’s a coaching-and-optimization finding, not a cancel-the-tools finding. (Note: this guide deliberately avoids a single ROI multiplier—the metric combination is more honest and more durable than a manufactured payback figure.)

The through-line

In every case, the single metric pointed one way and the truth lay in the combination. Build the habit: when a number surprises you, don’t act on it—ask which two or three other metrics would confirm or overturn the obvious reading. Delivery next to collaboration. Defect rate next to coverage, reviews, scope, and staffing. AI usage next to quality. The metric tells you where to look; the combination tells you what’s true; and the conversation with the person tells you why.

GitHub

Bitbucket

Notion

Repository Validation

User Mapping and Attribution

Pensero Knowledge Base

What Pensero Is

The questions it answers

Reading metrics in context

Navigating Pensero

Date range

Scope filter — managers and executives

Cohorts — reusable groups

Finding your way around

The People table

Practical habits

What Pensero Measures

Delivery — output and capacity

Quality — code quality and technical excellence

Efficiency — speed and bottlenecks

AI Impact — adoption and effectiveness

Collaboration (Reviews) — review quality and knowledge sharing

Scope of Work — investment mix

Talent — composition and knowledge risk

Financial — accounting and compliance

Reading metrics together

Healthy patterns

Warning patterns

Categories at a glance

How Delivery Is Scored

Magnitude — size of the change

Complexity — difficulty of the work

Worked examples

What counts as work

How each artifact is scored

Boilerplate filtering

Penalties

Reading penalties as a manager

How scores roll up

Reading delivery well

Healthy signs

Warning signs

Common questions

Can engineers game their score?

What about glue work that doesn’t show up?

Why did the score change retroactively?

How do I explain a low score to an engineer?

How Quality Is Measured

Detection runs automatically on every merge

How bugs are traced to their source

How each metric is detected

Defect rate — bugs the org introduced

Rework rate — rewriting recent code

Revert rate — rolled-back PRs

Duplicate code rate — copy-pasted code

Code coverage — test coverage

Reading the metrics together

The delivery-vs-quality quadrant

Using quality data in conversations

Can engineers game quality metrics?

How Reviews & Collaboration Are Measured

What counts as review work

How the harder metrics are detected

Review usefulness — is the feedback substantive?

Addressed rate — was the feedback acted on?

Pairing rate — collaborative authorship

Collaboration patterns to recognize

The delivery-vs-reviews quadrant

Reading review ratios in context

Common questions

Should I mandate code reviews?

Is a low pairing rate bad?

Review usefulness looks low—what now?

How Scope of Work Is Categorized

How work gets classified

Roadmap alignment is configured, not inferred

Reading the investment mix

Using scope data well

Common questions

How accurate is the AI categorization?

What about mixed or ambiguous work?

Is a high KTLO number bad?

How Efficiency Is Measured

Why P90, and the other statistical views

The four time metrics, and what sits between them