24
Total Metrics
12 categories across Recommendation & Output Quality.
6
Pure Gate
Pass/Fail per Journey. Tolerance varies per metric.
5
Has Gate Threshold
Two severity levels, each with its own tolerance.
13
Pure Quality
Graded 1–5. Target = % of Journeys scoring ≥ N.
Eval Framework
Every metric is evaluated along two independent dimensions. Both must be defined for each metric.
Dimension 1 — Metric Type
How do we judge a single Journey on this metric?
| Type | Judgment | Meaning |
| 🔴 Pure Gate |
Pass / Fail |
Binary. The Journey either meets the bar or it doesn’t. No partial credit. |
| 🟡 Has Gate Threshold |
Pass / Fail with two severity levels |
Same metric measures two kinds of failure: severe (gate) and mild (quality). Each level has its own tolerance. |
| 🟢 Pure Quality |
1–5 Score |
Graded on a spectrum. No removal — only quality improvement. |
Dimension 2 — Tolerance
Across a batch of Journeys, how many failures do we allow?
| Tolerance | Definition | When to use |
| Zero Tolerance |
100% must pass. A single failure blocks release. |
Safety, compliance, privacy — any failure is a trust catastrophe. |
| Partial Tolerance |
≤ X% may fail (or ≥ Y% must pass). Defined per metric. |
Most metrics — real-world signals are noisy. |
Part 1 · Pre-Click
Recommendation Quality
Evaluates whether the system recommends the right Journeys, in the right order, with clear and honest presentation.
4 categories, 13 metrics. Evaluate whether each individual Journey is compliant, safe, eligible, correctly understood, and clearly presented.
Evaluation order: Compliance gate → Should generate → Task understanding → Presentation & promise.
Safety & Boundary→Eligibility→Task Understanding→Presentation & Promise
1
Safety & Boundary
P0 Gate
1.1
Compliance Boundary Fit
Does this Journey only use data permitted by the Commercial Journeys pipeline, current surface, and user permission scope?
Pure Gate
Spec Details
Why It Matters
Using out-of-scope data (tenant boundary, retention, consent, permission) is a compliance violation that can expose Microsoft to legal liability.
Threshold
🔴 Gate Failure
Any Journey that references data outside the user’s permitted scope (wrong tenant, expired retention, no consent, higher permission tier).
Tolerance & Target
Zero Tolerance. 100% compliance rate. A single violation blocks release.
Failure Examples
- GateJourney surfaces content from a shared mailbox the user does not have permission to access.
- GateJourney uses email data beyond the consented retention window.
1.2
Sensitive Exposure
Does the recommendation layer (title, summary, reason, source preview) expose sensitive information that should not appear on NTP/card?
Pure Gate
Spec Details
Why It Matters
Exposing sensitive content on a visible card layer is a trust catastrophe and compliance incident.
Threshold
🔴 Gate Failure
Any instance where the card layer surfaces sensitive content (PII, health, financial, HR, legal, credentials).
Tolerance & Target
Zero Tolerance. 100% block rate on sensitive-tagged NEG test set.
Failure Examples
- Gate“Continue researching cancer treatment options” — health browsing data exposed.
- GateReason label: “Based on your salary review email” — compensation context exposed.
2
Eligibility / Should Generate
P0 Gate
2.1
Work Task Qualification
Does this Journey correspond to a real commercial work task, rather than FYI, newsletter, system notification, or background noise?
Pure Gate
Spec Details
Why It Matters
A Journey for non-actionable information has zero user value and trains the user to ignore the feature.
Threshold
🔴 Gate Failure
Journey is generated from a non-task signal: mass email, notification, FYI-only item, or background noise.
Tolerance & Target
Partial Tolerance. Non-task rate ≤ 2% of all generated Journeys.
Failure Examples
- Gate“Review weekly IT security newsletter” — FYI email, not a work task.
- Gate“Check system alert: password expiry reminder” — automated notification.
2.2
Active State
Is the task still active — not completed, not cancelled, not closed, and not delegated?
Pure Gate
Spec Details
Why It Matters
Showing a Journey for a completed/cancelled task signals the system is out of date and erodes trust.
Threshold
🔴 Gate Failure
Task has clear completion/cancellation/delegation signals yet Journey is still surfaced.
Tolerance & Target
Partial Tolerance. Stale-task rate ≤ 5%.
Failure Examples
- Gate“Prepare deck for Monday standup” — meeting already happened 2 days ago.
- Gate“Reply to vendor RFP” — user already sent the reply.
2.3
Meaningful Effort Threshold
Does the task have sufficient cognitive or execution cost to warrant proactive recommendation — not single-click, one-line reply, or other trivial action?
Pure Gate
Spec Details
Why It Matters
Recommending AI help for trivial tasks insults the user and erodes perceived value of the feature.
Threshold
🔴 Gate Failure
Task requires ≤ 1 step or ≤ 30 seconds to complete without AI. No synthesis, drafting, or research needed.
Tolerance & Target
Partial Tolerance. Trivial-task rate ≤ 5%.
Failure Examples
- Gate“Open the Teams meeting link” — single click, no AI value.
- Gate“Mark email as read” — trivial action.
2.4
AI Assistance Fit
Can current AI capabilities actually help advance this task — avoiding recommendations for work AI cannot deliver on?
Pure Gate
Spec Details
Why It Matters
Tasks requiring physical presence, emotional judgment, or actions AI cannot perform create false promises.
Threshold
🔴 Gate Failure
Task completion requires actions AI fundamentally cannot perform: physical action, real-time human interaction, or purely relational judgment.
Tolerance & Target
Partial Tolerance. AI-unfit rate ≤ 3%.
Failure Examples
- Gate“Attend team offsite dinner at 7pm” — physical presence required.
- Gate“Comfort upset team member about reorg” — purely emotional/relational.
3
Task Understanding
Mixed Gate + Quality
3.1
Task Accuracy
Does this Journey accurately describe the task goal, target, deadline, stakeholder, and expected action?
Gate Threshold
Spec Details
Why It Matters
A fabricated task wastes user time and destroys trust. A real task with minor errors is annoying but recoverable.
Threshold — Two Levels
🔴 Gate: Phantom Task
Journey describes a task that does not exist in the user’s actual work context.
🟢 Quality Scale (1–5): Description Accuracy
5 = all details (goal, deadline, stakeholder, action) perfectly accurate; 4 = one minor inaccuracy (e.g., off-by-one-day deadline); 3 = notable errors but task is recognizable; 2 = multiple major errors; 1 = barely resembles the real task.
Tolerance & Target
Gate level: Zero Tolerance. Phantom task rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Failure Examples
- Gate“Prepare for 1:1 with Sarah” — no such meeting exists on calendar.
- Quality“Submit budget report by Friday” — real task, but deadline is actually next Monday.
3.2
Groundedness Accuracy
Are all core claims in this Journey correctly supported by source signals — with no incorrect citations, incorrect inferences, or unsupported claims?
Gate Threshold
Spec Details
Why It Matters
Hallucinated details destroy user trust and can lead to embarrassing or incorrect actions.
Threshold — Two Levels
🔴 Gate: Full Hallucination
A core claim (person, event, deadline, document) is entirely fabricated with no source signal.
🟢 Quality Scale (1–5): Groundedness
5 = every claim precisely matches source; 4 = one minor paraphrasing drift; 3 = noticeable approximation gaps; 2 = multiple unsupported inferences; 1 = mostly ungrounded narrative.
Tolerance & Target
Gate level: Zero Tolerance. Full hallucination rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Failure Examples
- GateCard mentions “meeting with David Chen” — no such person in user’s contacts.
- QualityCard says “3 attachments” — source email actually has 2.
3.3
Task Granularity
Is the task granularity appropriate — neither too broad nor too narrow?
Pure Quality
Spec Details
Why It Matters
Too broad = user can’t act; too narrow = trivial sub-step that doesn’t warrant a Journey card.
Threshold
🟢 Quality Scale (1–5)
5 = perfect granularity; 4 = slightly too broad/narrow; 3 = noticeably off; 2 = significantly misscoped; 1 = unusable scope.
Tolerance & Target
Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Failure Examples
- Quality“Manage Q3 product launch” — too broad, covers dozens of tasks.
- Quality“Add comma to slide 3” — too narrow for a Journey.
3.4
Should-User-Act Confidence
Considering ownership, assignment, role, delegation, and stakeholder expectation — should this task be driven by the current user, rather than being a CC recipient, FYI receiver, optional attendee, or already delegated work?
Gate Threshold
Spec Details
Why It Matters
Recommending tasks the user isn’t responsible for wastes attention and signals poor understanding of role context.
Threshold — Two Levels
🔴 Gate: Wrong Owner
Task clearly belongs to someone else (user is CC, optional attendee, or task was explicitly delegated away).
🟢 Quality Scale (1–5): Ownership Confidence
5 = unambiguous ownership (direct assignee, sole recipient, explicit request); 4 = strong signals (primary on thread, named in action); 3 = reasonable but debatable; 2 = weak signals, likely wrong user; 1 = clearly someone else’s task.
Tolerance & Target
Gate level: Partial Tolerance. Wrong-owner rate ≤ 3%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Failure Examples
- GateUser is CC on email thread but Journey says “Reply to client request” — sender was asking someone else.
- QualityGroup email with no clear owner; system picks this user but could be anyone on the thread.
4
Presentation & Promise
Mixed Gate + Quality
4.1
Card Clarity
Can the user immediately understand what this Journey is and what they get after clicking — from the title, summary, and CTA alone?
Pure Quality
Spec Details
Why It Matters
If the user can’t understand the card in 3 seconds, they skip it.
Threshold
🟢 Quality Scale (1–5)
5 = instantly clear; 4 = clear with brief thought; 3 = requires re-reading; 2 = confusing; 1 = incomprehensible.
Tolerance & Target
Partial Tolerance. ≥ 85% of Journeys score ≥ 4.
Failure Examples
- Quality“Follow up on the thing discussed” — what thing? With whom?
- Quality“Action needed re: Q3” — too vague to act on.
4.2
Reason Label Accuracy
Is the “Why now” accurate — e.g., is due soon, requested by stakeholder, or before upcoming meeting supported by real evidence?
Gate Threshold
Spec Details
Why It Matters
Fabricated urgency signals destroy trust faster than missing signals. Users rely on reason labels to decide priority.
Threshold — Two Levels
🔴 Gate: Fabricated Reason
Reason label claims an urgency/trigger that has no basis in source data.
🟢 Quality Scale (1–5): Reason Precision
5 = reason label precisely matches evidence (correct trigger, correct timing); 4 = directionally correct with minor imprecision; 3 = loosely supported; 2 = misleading framing of real signal; 1 = reason contradicts source data.
Tolerance & Target
Gate level: Zero Tolerance. Fabricated reason rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Failure Examples
- Gate“Due tomorrow” — no deadline exists in any source signal.
- Quality“Before your 2pm meeting” — meeting is actually at 4pm.
4.3
Promise Feasibility
Can the result or capability promised on the card actually be delivered by the subsequent Ready-to-use Output — with no over-promising?
Gate Threshold
Spec Details
Why It Matters
Over-promising and under-delivering is the fastest way to kill repeat usage.
Threshold — Two Levels
🔴 Gate: Impossible Promise
Card promises something the system fundamentally cannot deliver (e.g., write access it doesn’t have).
🟢 Quality Scale (1–5): Promise Calibration
5 = output matches or exceeds card promise; 4 = slight under-delivery on one aspect; 3 = noticeable gap between promise and output; 2 = significant over-promise; 1 = card promise is completely unmet despite being technically possible.
Tolerance & Target
Gate level: Zero Tolerance. Impossible promise rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Failure Examples
- Gate“I’ll schedule the meeting for you” — system has no calendar write access.
- Quality“Full competitive analysis” — output is 3 bullet points from one source.
5 categories, 5 metrics. Evaluate the set of Journeys presented together as a slate — ranking, coverage, diversity, and deduplication.
5.1
Important Journey Miss Rate
Among high-value tasks that should be recommended in the user’s current work context (clearly active, user should drive, AI can help) — how many are ultimately missing from the visible slate? Loss scenarios include not generated, incorrectly filtered, incorrectly merged, or ranked too low to appear. Measures end-to-end “are expected Journeys missing?”
Pure Quality
Spec Details
Why It Matters
Missing important tasks is the most damaging failure for a proactive assistant — user loses trust that the system has their back.
Threshold
🟢 Quality Scale (1–5)
5 = all important tasks covered; 4 = one minor miss; 3 = notable gaps; 2 = major tasks missing; 1 = slate misses most important work.
Tolerance & Target
Partial Tolerance. ≥ 80% of slates score ≥ 4 on coverage.
Failure Examples
- QualityUser has a VP-requested deliverable due today — not in slate because ranking pushed it below fold.
- QualityCritical email reply was incorrectly merged into another Journey and lost its identity.
6
Global Prioritization
Quality
5.2
Global Ranking Quality
Is the overall ranking of all generated Journeys reasonable — aligned with deadline, stakeholder importance, ownership strength, business impact, recency, and other priority signals?
Pure Quality
Spec Details
Why It Matters
Users look at the top few items first. Poor ranking means the most important tasks are buried.
Threshold
🟢 Quality Scale (1–5)
5 = perfect priority order; 4 = minor swap needed; 3 = noticeably wrong order; 2 = important items buried; 1 = random/inverse order.
Tolerance & Target
Partial Tolerance. ≥ 80% of slates score ≥ 4.
Failure Examples
- QualityCEO request ranked #5 while newsletter-derived task ranked #1.
- QualityTask due in 1 hour ranked below task due next week.
5.3
Top-3 Importance Fit
Are the NTP Top 3 truly the most important, most urgent, and most worthwhile Journeys for the user to handle right now?
Pure Quality
Spec Details
Why It Matters
Top 3 is the “hero zone” — most users only engage with the first few items. Getting these wrong is the highest-impact ranking failure.
Threshold
🟢 Quality Scale (1–5)
5 = all 3 are the right picks; 4 = 2 of 3 correct; 3 = 1 of 3 correct; 2 = none correct but relevant; 1 = irrelevant items in top 3.
Tolerance & Target
Partial Tolerance. ≥ 75% of slates score ≥ 4.
Failure Examples
- QualityTop 3 contains a low-priority FYI task while a deadline-today task is at position #5.
- QualityAll top 3 are from same email thread; urgent cross-team request is buried.
8
Portfolio Balance
Quality
5.4
Useful Diversity
Without sacrificing value or priority, does the slate cover sufficiently diverse high-value tasks — rather than being dominated by a single trigger, topic, or type?
Pure Quality
Spec Details
Why It Matters
A slate dominated by one trigger (e.g., 5 Journeys from same email) feels broken and misses other important work.
Threshold
🟢 Quality Scale (1–5)
5 = well-balanced coverage; 4 = slightly concentrated; 3 = noticeably dominated by one source; 2 = heavily skewed; 1 = all from single trigger.
Tolerance & Target
Partial Tolerance. ≥ 80% of slates score ≥ 4.
Failure Examples
- Quality5 of 7 Journeys all derived from the same meeting invite thread.
- QualityAll Journeys are “email reply” type; no meeting prep or document tasks shown.
9
Set Hygiene
Mixed Gate + Quality
5.5
Duplicate / Split / Merge Quality
Does the slate contain duplicate Journeys, a single task split into multiple Journeys, or different tasks incorrectly merged into one Journey?
Gate Threshold
Spec Details
Why It Matters
Duplicates waste slots and feel broken. Bad splits confuse; bad merges lose task identity.
Threshold — Two Levels
🔴 Gate: Exact Duplicate
Two Journeys in the same slate describe the exact same task (same action, same object, same context).
🟢 Quality Scale (1–5): Boundary Correctness
5 = every Journey maps to exactly one distinct task, no fragmentation or merging; 4 = one borderline split/merge case; 3 = noticeable boundary issues (2+ cases); 2 = significant fragmentation or loss from merging; 1 = slate is riddled with split/merge problems.
Tolerance & Target
Gate level: Zero Tolerance. Exact duplicate rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of slates score ≥ 4.
Failure Examples
- GateTwo cards both say “Reply to Sarah’s budget question” from same email.
- Quality“Prepare meeting agenda” and “Add topics to Monday standup” are actually the same task split into two.
Part 2 · Post-Click
Output Quality
Evaluates whether the AI output delivered after the user clicks a Journey card fulfills the promise, is correct, and is useful. 3 categories, 6 metrics.
10
Promise-Delivery Fit
Mixed
6.1
Promise Fulfillment
Does the output deliver the content and assistance goal promised by the card?
Gate Threshold
Spec Details
Why It Matters
The card sets an expectation. If the output doesn’t match, user feels deceived regardless of output quality.
Threshold — Two Levels
🔴 Gate: Complete Mismatch
Output is about a different topic or task than what was promised on the card.
🟢 Quality Scale (1–5): Delivery Completeness
5 = output fully delivers everything the card promised; 4 = one minor element missing; 3 = right topic but notable gaps vs. promise; 2 = significant under-delivery; 1 = barely related to promise.
Tolerance & Target
Gate level: Zero Tolerance. Complete mismatch rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Failure Examples
- GateCard: “Draft reply to vendor proposal.” Output: generic meeting prep notes.
- QualityCard: “Summarize all action items from standup.” Output covers only 2 of 5 items.
11
Output Correctness
Mixed
6.2
Factual Accuracy
Are the facts in the output correctly supported by source signals, with no errors or hallucination?
Gate Threshold
Spec Details
Why It Matters
Hallucinated facts in outputs can lead to incorrect actions with real business consequences.
Threshold — Two Levels
🔴 Gate: Fabricated Fact
Output contains a factual claim (name, date, number, decision) with no basis in source data.
🟢 Quality Scale (1–5): Factual Precision
5 = every fact precisely matches source; 4 = one minor imprecision (rounded number, approximate time); 3 = noticeable inaccuracies but gist correct; 2 = multiple factual errors; 1 = output is largely inaccurate.
Tolerance & Target
Gate level: Zero Tolerance. Fabricated fact rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Failure Examples
- GateOutput claims “Budget approved at $500K” — no such approval in source emails.
- QualityOutput says “meeting at 2pm” — actually 2:30pm.
6.3
Completeness
Does the output cover all key threads and context relevant to the task, with no important information omitted?
Pure Quality
Spec Details
Why It Matters
Incomplete output forces the user to find and fill gaps, reducing time savings.
Threshold
🟢 Quality Scale (1–5)
5 = all key information covered; 4 = one minor gap; 3 = notable gaps; 2 = major omissions; 1 = barely started.
Tolerance & Target
Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Failure Examples
- Quality“Summarize project status” but output only covers timeline, not blockers or risks.
- QualityMeeting prep misses half the agenda topics from the invite.
6.4
Scenario Fit
Does the output format and assistance type match the task scenario — e.g., meeting prep → agenda + talking points; email reply → draft; status update → structured brief?
Pure Quality
Spec Details
Why It Matters
Wrong format adds conversion work. A draft email should look like an email; a meeting prep should be structured talking points.
Threshold
🟢 Quality Scale (1–5)
5 = perfect scenario match; 4 = acceptable format; 3 = workable but not ideal; 2 = awkward format; 1 = completely wrong format.
Tolerance & Target
Partial Tolerance. ≥ 85% of Journeys score ≥ 4.
Failure Examples
- QualityTask is “draft reply email” but output is bullet-point analysis, not email format.
- QualityTask is “compare 3 options” but output is paragraph prose, not comparison table.
6.5
Ready-to-Use Quality
Does the output format, structure, language, and length meet ready-to-use standards — usable directly or with only minor edits?
Pure Quality
Spec Details
Why It Matters
Output that requires significant rework defeats the purpose of proactive AI assistance.
Threshold
🟢 Quality Scale (1–5)
5 = directly usable as-is; 4 = minor edits needed; 3 = moderate rework; 2 = heavy rework; 1 = start over.
Tolerance & Target
Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Failure Examples
- QualityDraft email is so generic it needs complete rewriting to send.
- QualityMeeting prep has wrong tone — too casual for executive audience.
6.6
Task Advancement
Can the user meaningfully advance the task solely with this output — with concrete actionable next steps, without needing to search back or start over?
Pure Quality
Spec Details
Why It Matters
The ultimate measure of output value: did it actually move the user forward on their task?
Threshold
🟢 Quality Scale (1–5)
5 = task meaningfully advanced, clear next step; 4 = mostly advanced, minor gap; 3 = some progress; 2 = marginal help; 1 = no advancement, user still at square one.
Tolerance & Target
Partial Tolerance. ≥ 70% of Journeys score ≥ 4.
Failure Examples
- QualityOutput is a single sentence the user could have written faster themselves.
- QualitySummary is so generic the user still needs to re-read the full source material.