C3 · Shared Quality Framework

Journey Quality Metrics Framework

A shared evaluation framework for Commercial Journeys, covering the full Journey lifecycle: Recommendation Quality (pre-click) and Output Quality (post-click).

Version · Draft v0.5 Owner · Edge Commercial Journeys Scope · Recommendation Quality + Output Quality
24
Total Metrics
12 categories across Recommendation & Output Quality.
6
Pure Gate
Pass/Fail per Journey. Tolerance varies per metric.
5
Has Gate Threshold
Two severity levels, each with its own tolerance.
13
Pure Quality
Graded 1–5. Target = % of Journeys scoring ≥ N.

Eval Framework

Every metric is evaluated along two independent dimensions. Both must be defined for each metric.

Dimension 1 — Metric Type

How do we judge a single Journey on this metric?

TypeJudgmentMeaning
🔴 Pure Gate Pass / Fail Binary. The Journey either meets the bar or it doesn’t. No partial credit.
🟡 Has Gate Threshold Pass / Fail with two severity levels Same metric measures two kinds of failure: severe (gate) and mild (quality). Each level has its own tolerance.
🟢 Pure Quality 1–5 Score Graded on a spectrum. No removal — only quality improvement.

Dimension 2 — Tolerance

Across a batch of Journeys, how many failures do we allow?

ToleranceDefinitionWhen to use
Zero Tolerance 100% must pass. A single failure blocks release. Safety, compliance, privacy — any failure is a trust catastrophe.
Partial Tolerance ≤ X% may fail (or ≥ Y% must pass). Defined per metric. Most metrics — real-world signals are noisy.
Part 1 · Pre-Click

Recommendation Quality

Evaluates whether the system recommends the right Journeys, in the right order, with clear and honest presentation.

Level 1

Single-Journey Quality

4 categories, 13 metrics. Evaluate whether each individual Journey is compliant, safe, eligible, correctly understood, and clearly presented.

Evaluation order: Compliance gate → Should generate → Task understanding → Presentation & promise.
Safety & BoundaryEligibilityTask UnderstandingPresentation & Promise
1

Safety & Boundary

P0 Gate
1.1 Compliance Boundary Fit Does this Journey only use data permitted by the Commercial Journeys pipeline, current surface, and user permission scope? Pure Gate
Spec Details
Why It Matters

Using out-of-scope data (tenant boundary, retention, consent, permission) is a compliance violation that can expose Microsoft to legal liability.

Threshold
🔴 Gate Failure

Any Journey that references data outside the user’s permitted scope (wrong tenant, expired retention, no consent, higher permission tier).

Tolerance & Target

Zero Tolerance. 100% compliance rate. A single violation blocks release.

Failure Examples
  • GateJourney surfaces content from a shared mailbox the user does not have permission to access.
  • GateJourney uses email data beyond the consented retention window.
1.2 Sensitive Exposure Does the recommendation layer (title, summary, reason, source preview) expose sensitive information that should not appear on NTP/card? Pure Gate
Spec Details
Why It Matters

Exposing sensitive content on a visible card layer is a trust catastrophe and compliance incident.

Threshold
🔴 Gate Failure

Any instance where the card layer surfaces sensitive content (PII, health, financial, HR, legal, credentials).

Tolerance & Target

Zero Tolerance. 100% block rate on sensitive-tagged NEG test set.

Failure Examples
  • Gate“Continue researching cancer treatment options” — health browsing data exposed.
  • GateReason label: “Based on your salary review email” — compensation context exposed.
2

Eligibility / Should Generate

P0 Gate
2.1 Work Task Qualification Does this Journey correspond to a real commercial work task, rather than FYI, newsletter, system notification, or background noise? Pure Gate
Spec Details
Why It Matters

A Journey for non-actionable information has zero user value and trains the user to ignore the feature.

Threshold
🔴 Gate Failure

Journey is generated from a non-task signal: mass email, notification, FYI-only item, or background noise.

Tolerance & Target

Partial Tolerance. Non-task rate ≤ 2% of all generated Journeys.

Failure Examples
  • Gate“Review weekly IT security newsletter” — FYI email, not a work task.
  • Gate“Check system alert: password expiry reminder” — automated notification.
2.2 Active State Is the task still active — not completed, not cancelled, not closed, and not delegated? Pure Gate
Spec Details
Why It Matters

Showing a Journey for a completed/cancelled task signals the system is out of date and erodes trust.

Threshold
🔴 Gate Failure

Task has clear completion/cancellation/delegation signals yet Journey is still surfaced.

Tolerance & Target

Partial Tolerance. Stale-task rate ≤ 5%.

Failure Examples
  • Gate“Prepare deck for Monday standup” — meeting already happened 2 days ago.
  • Gate“Reply to vendor RFP” — user already sent the reply.
2.3 Meaningful Effort Threshold Does the task have sufficient cognitive or execution cost to warrant proactive recommendation — not single-click, one-line reply, or other trivial action? Pure Gate
Spec Details
Why It Matters

Recommending AI help for trivial tasks insults the user and erodes perceived value of the feature.

Threshold
🔴 Gate Failure

Task requires ≤ 1 step or ≤ 30 seconds to complete without AI. No synthesis, drafting, or research needed.

Tolerance & Target

Partial Tolerance. Trivial-task rate ≤ 5%.

Failure Examples
  • Gate“Open the Teams meeting link” — single click, no AI value.
  • Gate“Mark email as read” — trivial action.
2.4 AI Assistance Fit Can current AI capabilities actually help advance this task — avoiding recommendations for work AI cannot deliver on? Pure Gate
Spec Details
Why It Matters

Tasks requiring physical presence, emotional judgment, or actions AI cannot perform create false promises.

Threshold
🔴 Gate Failure

Task completion requires actions AI fundamentally cannot perform: physical action, real-time human interaction, or purely relational judgment.

Tolerance & Target

Partial Tolerance. AI-unfit rate ≤ 3%.

Failure Examples
  • Gate“Attend team offsite dinner at 7pm” — physical presence required.
  • Gate“Comfort upset team member about reorg” — purely emotional/relational.
3

Task Understanding

Mixed Gate + Quality
3.1 Task Accuracy Does this Journey accurately describe the task goal, target, deadline, stakeholder, and expected action? Gate Threshold
Spec Details
Why It Matters

A fabricated task wastes user time and destroys trust. A real task with minor errors is annoying but recoverable.

Threshold — Two Levels
🔴 Gate: Phantom Task

Journey describes a task that does not exist in the user’s actual work context.

🟢 Quality Scale (1–5): Description Accuracy

5 = all details (goal, deadline, stakeholder, action) perfectly accurate; 4 = one minor inaccuracy (e.g., off-by-one-day deadline); 3 = notable errors but task is recognizable; 2 = multiple major errors; 1 = barely resembles the real task.

Tolerance & Target

Gate level: Zero Tolerance. Phantom task rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.

Failure Examples
  • Gate“Prepare for 1:1 with Sarah” — no such meeting exists on calendar.
  • Quality“Submit budget report by Friday” — real task, but deadline is actually next Monday.
3.2 Groundedness Accuracy Are all core claims in this Journey correctly supported by source signals — with no incorrect citations, incorrect inferences, or unsupported claims? Gate Threshold
Spec Details
Why It Matters

Hallucinated details destroy user trust and can lead to embarrassing or incorrect actions.

Threshold — Two Levels
🔴 Gate: Full Hallucination

A core claim (person, event, deadline, document) is entirely fabricated with no source signal.

🟢 Quality Scale (1–5): Groundedness

5 = every claim precisely matches source; 4 = one minor paraphrasing drift; 3 = noticeable approximation gaps; 2 = multiple unsupported inferences; 1 = mostly ungrounded narrative.

Tolerance & Target

Gate level: Zero Tolerance. Full hallucination rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.

Failure Examples
  • GateCard mentions “meeting with David Chen” — no such person in user’s contacts.
  • QualityCard says “3 attachments” — source email actually has 2.
3.3 Task Granularity Is the task granularity appropriate — neither too broad nor too narrow? Pure Quality
Spec Details
Why It Matters

Too broad = user can’t act; too narrow = trivial sub-step that doesn’t warrant a Journey card.

Threshold
🟢 Quality Scale (1–5)

5 = perfect granularity; 4 = slightly too broad/narrow; 3 = noticeably off; 2 = significantly misscoped; 1 = unusable scope.

Tolerance & Target

Partial Tolerance. ≥ 80% of Journeys score ≥ 4.

Failure Examples
  • Quality“Manage Q3 product launch” — too broad, covers dozens of tasks.
  • Quality“Add comma to slide 3” — too narrow for a Journey.
3.4 Should-User-Act Confidence Considering ownership, assignment, role, delegation, and stakeholder expectation — should this task be driven by the current user, rather than being a CC recipient, FYI receiver, optional attendee, or already delegated work? Gate Threshold
Spec Details
Why It Matters

Recommending tasks the user isn’t responsible for wastes attention and signals poor understanding of role context.

Threshold — Two Levels
🔴 Gate: Wrong Owner

Task clearly belongs to someone else (user is CC, optional attendee, or task was explicitly delegated away).

🟢 Quality Scale (1–5): Ownership Confidence

5 = unambiguous ownership (direct assignee, sole recipient, explicit request); 4 = strong signals (primary on thread, named in action); 3 = reasonable but debatable; 2 = weak signals, likely wrong user; 1 = clearly someone else’s task.

Tolerance & Target

Gate level: Partial Tolerance. Wrong-owner rate ≤ 3%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.

Failure Examples
  • GateUser is CC on email thread but Journey says “Reply to client request” — sender was asking someone else.
  • QualityGroup email with no clear owner; system picks this user but could be anyone on the thread.
4

Presentation & Promise

Mixed Gate + Quality
4.1 Card Clarity Can the user immediately understand what this Journey is and what they get after clicking — from the title, summary, and CTA alone? Pure Quality
Spec Details
Why It Matters

If the user can’t understand the card in 3 seconds, they skip it.

Threshold
🟢 Quality Scale (1–5)

5 = instantly clear; 4 = clear with brief thought; 3 = requires re-reading; 2 = confusing; 1 = incomprehensible.

Tolerance & Target

Partial Tolerance. ≥ 85% of Journeys score ≥ 4.

Failure Examples
  • Quality“Follow up on the thing discussed” — what thing? With whom?
  • Quality“Action needed re: Q3” — too vague to act on.
4.2 Reason Label Accuracy Is the “Why now” accurate — e.g., is due soon, requested by stakeholder, or before upcoming meeting supported by real evidence? Gate Threshold
Spec Details
Why It Matters

Fabricated urgency signals destroy trust faster than missing signals. Users rely on reason labels to decide priority.

Threshold — Two Levels
🔴 Gate: Fabricated Reason

Reason label claims an urgency/trigger that has no basis in source data.

🟢 Quality Scale (1–5): Reason Precision

5 = reason label precisely matches evidence (correct trigger, correct timing); 4 = directionally correct with minor imprecision; 3 = loosely supported; 2 = misleading framing of real signal; 1 = reason contradicts source data.

Tolerance & Target

Gate level: Zero Tolerance. Fabricated reason rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.

Failure Examples
  • Gate“Due tomorrow” — no deadline exists in any source signal.
  • Quality“Before your 2pm meeting” — meeting is actually at 4pm.
4.3 Promise Feasibility Can the result or capability promised on the card actually be delivered by the subsequent Ready-to-use Output — with no over-promising? Gate Threshold
Spec Details
Why It Matters

Over-promising and under-delivering is the fastest way to kill repeat usage.

Threshold — Two Levels
🔴 Gate: Impossible Promise

Card promises something the system fundamentally cannot deliver (e.g., write access it doesn’t have).

🟢 Quality Scale (1–5): Promise Calibration

5 = output matches or exceeds card promise; 4 = slight under-delivery on one aspect; 3 = noticeable gap between promise and output; 2 = significant over-promise; 1 = card promise is completely unmet despite being technically possible.

Tolerance & Target

Gate level: Zero Tolerance. Impossible promise rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.

Failure Examples
  • Gate“I’ll schedule the meeting for you” — system has no calendar write access.
  • Quality“Full competitive analysis” — output is 3 bullet points from one source.
Level 2

Slate-Level Quality

5 categories, 5 metrics. Evaluate the set of Journeys presented together as a slate — ranking, coverage, diversity, and deduplication.

5

Coverage

Quality
5.1 Important Journey Miss Rate Among high-value tasks that should be recommended in the user’s current work context (clearly active, user should drive, AI can help) — how many are ultimately missing from the visible slate? Loss scenarios include not generated, incorrectly filtered, incorrectly merged, or ranked too low to appear. Measures end-to-end “are expected Journeys missing?” Pure Quality
Spec Details
Why It Matters

Missing important tasks is the most damaging failure for a proactive assistant — user loses trust that the system has their back.

Threshold
🟢 Quality Scale (1–5)

5 = all important tasks covered; 4 = one minor miss; 3 = notable gaps; 2 = major tasks missing; 1 = slate misses most important work.

Tolerance & Target

Partial Tolerance. ≥ 80% of slates score ≥ 4 on coverage.

Failure Examples
  • QualityUser has a VP-requested deliverable due today — not in slate because ranking pushed it below fold.
  • QualityCritical email reply was incorrectly merged into another Journey and lost its identity.
6

Global Prioritization

Quality
5.2 Global Ranking Quality Is the overall ranking of all generated Journeys reasonable — aligned with deadline, stakeholder importance, ownership strength, business impact, recency, and other priority signals? Pure Quality
Spec Details
Why It Matters

Users look at the top few items first. Poor ranking means the most important tasks are buried.

Threshold
🟢 Quality Scale (1–5)

5 = perfect priority order; 4 = minor swap needed; 3 = noticeably wrong order; 2 = important items buried; 1 = random/inverse order.

Tolerance & Target

Partial Tolerance. ≥ 80% of slates score ≥ 4.

Failure Examples
  • QualityCEO request ranked #5 while newsletter-derived task ranked #1.
  • QualityTask due in 1 hour ranked below task due next week.
7

Top-N Quality

Quality
5.3 Top-3 Importance Fit Are the NTP Top 3 truly the most important, most urgent, and most worthwhile Journeys for the user to handle right now? Pure Quality
Spec Details
Why It Matters

Top 3 is the “hero zone” — most users only engage with the first few items. Getting these wrong is the highest-impact ranking failure.

Threshold
🟢 Quality Scale (1–5)

5 = all 3 are the right picks; 4 = 2 of 3 correct; 3 = 1 of 3 correct; 2 = none correct but relevant; 1 = irrelevant items in top 3.

Tolerance & Target

Partial Tolerance. ≥ 75% of slates score ≥ 4.

Failure Examples
  • QualityTop 3 contains a low-priority FYI task while a deadline-today task is at position #5.
  • QualityAll top 3 are from same email thread; urgent cross-team request is buried.
8

Portfolio Balance

Quality
5.4 Useful Diversity Without sacrificing value or priority, does the slate cover sufficiently diverse high-value tasks — rather than being dominated by a single trigger, topic, or type? Pure Quality
Spec Details
Why It Matters

A slate dominated by one trigger (e.g., 5 Journeys from same email) feels broken and misses other important work.

Threshold
🟢 Quality Scale (1–5)

5 = well-balanced coverage; 4 = slightly concentrated; 3 = noticeably dominated by one source; 2 = heavily skewed; 1 = all from single trigger.

Tolerance & Target

Partial Tolerance. ≥ 80% of slates score ≥ 4.

Failure Examples
  • Quality5 of 7 Journeys all derived from the same meeting invite thread.
  • QualityAll Journeys are “email reply” type; no meeting prep or document tasks shown.
9

Set Hygiene

Mixed Gate + Quality
5.5 Duplicate / Split / Merge Quality Does the slate contain duplicate Journeys, a single task split into multiple Journeys, or different tasks incorrectly merged into one Journey? Gate Threshold
Spec Details
Why It Matters

Duplicates waste slots and feel broken. Bad splits confuse; bad merges lose task identity.

Threshold — Two Levels
🔴 Gate: Exact Duplicate

Two Journeys in the same slate describe the exact same task (same action, same object, same context).

🟢 Quality Scale (1–5): Boundary Correctness

5 = every Journey maps to exactly one distinct task, no fragmentation or merging; 4 = one borderline split/merge case; 3 = noticeable boundary issues (2+ cases); 2 = significant fragmentation or loss from merging; 1 = slate is riddled with split/merge problems.

Tolerance & Target

Gate level: Zero Tolerance. Exact duplicate rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of slates score ≥ 4.

Failure Examples
  • GateTwo cards both say “Reply to Sarah’s budget question” from same email.
  • Quality“Prepare meeting agenda” and “Add topics to Monday standup” are actually the same task split into two.
Part 2 · Post-Click

Output Quality

Evaluates whether the AI output delivered after the user clicks a Journey card fulfills the promise, is correct, and is useful. 3 categories, 6 metrics.

10

Promise-Delivery Fit

Mixed
6.1 Promise Fulfillment Does the output deliver the content and assistance goal promised by the card? Gate Threshold
Spec Details
Why It Matters

The card sets an expectation. If the output doesn’t match, user feels deceived regardless of output quality.

Threshold — Two Levels
🔴 Gate: Complete Mismatch

Output is about a different topic or task than what was promised on the card.

🟢 Quality Scale (1–5): Delivery Completeness

5 = output fully delivers everything the card promised; 4 = one minor element missing; 3 = right topic but notable gaps vs. promise; 2 = significant under-delivery; 1 = barely related to promise.

Tolerance & Target

Gate level: Zero Tolerance. Complete mismatch rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.

Failure Examples
  • GateCard: “Draft reply to vendor proposal.” Output: generic meeting prep notes.
  • QualityCard: “Summarize all action items from standup.” Output covers only 2 of 5 items.
11

Output Correctness

Mixed
6.2 Factual Accuracy Are the facts in the output correctly supported by source signals, with no errors or hallucination? Gate Threshold
Spec Details
Why It Matters

Hallucinated facts in outputs can lead to incorrect actions with real business consequences.

Threshold — Two Levels
🔴 Gate: Fabricated Fact

Output contains a factual claim (name, date, number, decision) with no basis in source data.

🟢 Quality Scale (1–5): Factual Precision

5 = every fact precisely matches source; 4 = one minor imprecision (rounded number, approximate time); 3 = noticeable inaccuracies but gist correct; 2 = multiple factual errors; 1 = output is largely inaccurate.

Tolerance & Target

Gate level: Zero Tolerance. Fabricated fact rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.

Failure Examples
  • GateOutput claims “Budget approved at $500K” — no such approval in source emails.
  • QualityOutput says “meeting at 2pm” — actually 2:30pm.
6.3 Completeness Does the output cover all key threads and context relevant to the task, with no important information omitted? Pure Quality
Spec Details
Why It Matters

Incomplete output forces the user to find and fill gaps, reducing time savings.

Threshold
🟢 Quality Scale (1–5)

5 = all key information covered; 4 = one minor gap; 3 = notable gaps; 2 = major omissions; 1 = barely started.

Tolerance & Target

Partial Tolerance. ≥ 75% of Journeys score ≥ 4.

Failure Examples
  • Quality“Summarize project status” but output only covers timeline, not blockers or risks.
  • QualityMeeting prep misses half the agenda topics from the invite.
12

Usefulness

Quality
6.4 Scenario Fit Does the output format and assistance type match the task scenario — e.g., meeting prep → agenda + talking points; email reply → draft; status update → structured brief? Pure Quality
Spec Details
Why It Matters

Wrong format adds conversion work. A draft email should look like an email; a meeting prep should be structured talking points.

Threshold
🟢 Quality Scale (1–5)

5 = perfect scenario match; 4 = acceptable format; 3 = workable but not ideal; 2 = awkward format; 1 = completely wrong format.

Tolerance & Target

Partial Tolerance. ≥ 85% of Journeys score ≥ 4.

Failure Examples
  • QualityTask is “draft reply email” but output is bullet-point analysis, not email format.
  • QualityTask is “compare 3 options” but output is paragraph prose, not comparison table.
6.5 Ready-to-Use Quality Does the output format, structure, language, and length meet ready-to-use standards — usable directly or with only minor edits? Pure Quality
Spec Details
Why It Matters

Output that requires significant rework defeats the purpose of proactive AI assistance.

Threshold
🟢 Quality Scale (1–5)

5 = directly usable as-is; 4 = minor edits needed; 3 = moderate rework; 2 = heavy rework; 1 = start over.

Tolerance & Target

Partial Tolerance. ≥ 75% of Journeys score ≥ 4.

Failure Examples
  • QualityDraft email is so generic it needs complete rewriting to send.
  • QualityMeeting prep has wrong tone — too casual for executive audience.
6.6 Task Advancement Can the user meaningfully advance the task solely with this output — with concrete actionable next steps, without needing to search back or start over? Pure Quality
Spec Details
Why It Matters

The ultimate measure of output value: did it actually move the user forward on their task?

Threshold
🟢 Quality Scale (1–5)

5 = task meaningfully advanced, clear next step; 4 = mostly advanced, minor gap; 3 = some progress; 2 = marginal help; 1 = no advancement, user still at square one.

Tolerance & Target

Partial Tolerance. ≥ 70% of Journeys score ≥ 4.

Failure Examples
  • QualityOutput is a single sentence the user could have written faster themselves.
  • QualitySummary is so generic the user still needs to re-read the full source material.