---
owner: owen@duel.tech
---

# Context Capsule Turing Test
## A Semantic Evaluation Framework for Capsule Quality, v1.1

> Core question: could a capable reader, human or AI, act correctly on this domain using only this document, with the fewest possible data hops, without asking a clarifying question, and without breaching Duel's AI governance contract?

If the answer is yes the capsule passes. If the reader would need to reach for another source, open a Slack thread, make an assumption, query data the brand has not consented to, or carry out a use case they do not have rights to perform, the capsule fails regardless of how well it is formatted.

This framework defines the tests, the scoring system and the gate thresholds that determine whether a context capsule is production-ready inside a Duel governance posture. The framework is served by `governance.aida.duel.tech` alongside `AI_GOVERNANCE_CONTEXT.md`; the classifier reads both at evaluation time.

---

## What changed from v1.0

v1.0 measured whether a capsule was self-contained, coherent, bounded and actionable. That is necessary but no longer sufficient. The classifier now builds semantic graphs over the corpus, and four operational gaps surfaced:

1. **Data origin is under-specified.** Capsules describe a concept but do not say what data backs it, where the data lives, how fresh it is and what shape it takes when queried.
2. **Agent orchestration is missing.** Capsules describe the concept but not which service an AI agent should call to reach it, with what call signature, in what sequence, and which calls to avoid.
3. **Usage intent and access rights are not declared.** Without a stated Capability Layer and Tier eligibility the data RBAC layer has nothing to bind to and agents default to "everything is allowed".
4. **No ISO 42001 surface.** Capsules describe domains, not use cases. Without enumerated use cases there is no risk treatment and no compliance trail. A-risk use cases (Agentic, Artificial, or High in `§ Risk Categories`) need stronger treatment than Minimal or Limited ones; v1.0 had no way of expressing that.

v1.1 closes those gaps. It splits the old "data" concern into two distinct dimensions, adds dimensions for usage intent, use-case risk treatment and a sub-dimension specifically for A-risk treatment quality, and introduces a hard Governance Gate. It also introduces a late-binding principle: the framework does not hard-code Duel's capability vocabulary, tier vocabulary or risk vocabulary. Those values live in named sections of `AI_GOVERNANCE_CONTEXT.md`, which is served by `governance.aida.duel.tech`. The classifier reads them from the service at evaluation time. Governance can evolve without bumping the framework.

The framework does not change the capsule format. The new dimensions are satisfied by canonical table shapes inside the existing `## How` section, defined in `CAPSULE_SPEC.md`. Frontmatter remains `owner:` only.

---

## Philosophy

A capsule that passes the structural audit can still be useless if it does not convey what something means, where it ends, what data backs it, how an agent should reach it, who is allowed to act on it, or what risks are attached.

A v1.1 capsule operationalises five principles:

- **Precision over approximation.** Definitions outlast implementations.
- **Actionability.** Designed for correct execution, not just comprehension.
- **Minimum data hops.** A consumer (especially an AI) knows after one read which service to call, with what call signature, and which calls to avoid.
- **Governance by construction.** Access rights and risk treatment are properties of the capsule, not afterthoughts. The capsule is the agent's permission slip.
- **Late binding.** The framework references governance by named section, not by inlined values. Governance evolves; the framework holds.

A capsule is a self-contained, governed, semantic contract that AI consumers can act on without breaching Duel's AI commitments.

---

## The Twelve Evaluation Dimensions

| # | Dimension | Weight | Gate (minimum) | New in v1.1? |
|---|-----------|--------|----------------|---|
| D1 | Self-Containment | 12% | 70% | |
| D2 | Buildability | 12% | 60% | |
| D3 | Semantic Coherence | 8% | 70% | |
| D4 | Boundary Precision | 10% | 70% | |
| D5 | Outcome to Output Correlation | 8% | 60% | |
| D6 | Definitional Completeness | 5% | 70% | |
| D7 | Actionability and Anti-Patterns | 5% | 60% | revised |
| D8 | Data Origin | 7% | 70% | new |
| D9 | Usage Intent and Access Rights | 12% | 80% | new |
| D10 | Use Cases and Risk Treatment | 5% | 70% | new |
| D10A | A-Risk Use Case Treatment | 6% | 80% | new |
| D11 | Agent Orchestration | 10% | 70% | new |

Total weight: 100%.

**Overall gate:** weighted score at or above 80% AND no dimension below its gate threshold AND the Governance Gate is clear (see end of document).

Each test is scored Pass (2), Partial (1) or Fail (0). Dimension score = `(sum of test scores / max possible) x 100`.

### Universal scoring with soft N/A for D8 and D10A

Every capsule is scored against every dimension. No conditional skipping. Two dimensions, D8 (Data Origin) and D10A (A-Risk Use Case Treatment), use a "soft N/A" rule because some concepts genuinely have nothing to score against and forcing the issue would be theatre.

**Soft N/A rule:**
- If a capsule declares absence cleanly via D8.6 ("no warehouse-resident data backs this concept" in the Data Origin table), D8 scores against D8.6 alone. The other D8 tests are not scored. The dimension scores 2/2 = 100% when absence is declared correctly.
- If a capsule declares absence cleanly via D10A.6 ("no A-risk use cases" in the A-risk row), D10A scores against D10A.6 alone. The other D10A tests are not scored. The dimension scores 2/2 = 100%.
- Absence must be declared explicitly. A blank section or a missing table scores Fail, not N/A.

**Universal (no N/A):**
- D9 (Usage Intent and Access). Every capsule has a Capability Layer and consumer roles, even a pure business capsule (typically Automation or Augmented with internal roles).
- D11 (Agent Orchestration). Every capsule is reachable by an agent. A capsule with no other consumption path names `context.aida.duel.tech` as the Primary entry point.
- D10 (Use Cases). A capsule that enables no AI use cases passes by saying so explicitly and pointing at downstream capsules that do (when relevant).
- D1 through D7 apply to all capsules regardless of kind.

The asymmetry is deliberate. D8 and D10A describe properties of the concept (does data exist? do A-risk use cases exist?). D9 and D11 describe properties of the capsule (who can act on it, how is it reached?). Properties of the concept can be legitimately absent; properties of the capsule cannot.

### Late binding to governance

D9, D10, D10A and the Governance Gate reference values that live in named sections of `AI_GOVERNANCE_CONTEXT.md`. The governance capsule is served by `governance.aida.duel.tech`; the classifier reads it from that service at evaluation time, not from a repo path.

- `§ Capability Spectrum` defines the valid Capability Layer values.
- `§ Involvement Tiers` defines the valid Tier values.
- `§ Hard Boundaries` defines the rules that capsules must acknowledge.
- `§ Risk Categories` defines the valid risk category values for use cases.

If governance changes (a new layer is added, a tier is renamed, a hard boundary is added) the framework does not change. Capsules are revalidated against the new vocabulary on the next push.

### Where the new content lives

Capsule frontmatter carries only `owner:`. Everything else lives in the prose body. The new dimensions are satisfied by canonical table shapes inside the existing `## How` section, set out in `CAPSULE_SPEC.md`:

- **Capability Layer and Tier eligibility** in a `### How, Usage Intent` table with rows `Capability Layer`, `Tier eligibility`, `Consumer roles`, `Hard boundaries`.
- **Data Origin** in a `### How, Data Origin` table with one row per source carrying System, Fully-qualified name, Provenance, Freshness, Grain, Known issues.
- **Agent Orchestration** in a `### How, Agent Orchestration` block with three labelled tables: Primary, Sequence (optional, required when the answer spans services), Avoid.
- **Use cases** in a `### How, Use Cases` table with columns Use case, Capability layer, Eligible tiers, Risk category, Treatment, Audit, plus a final row labelled "A-risk use cases" used by D10A.

---

## D1, Self-Containment (Weight: 12%)

> Can a reader understand this domain entirely from within this document, with zero external lookups?

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D1.1, Custom term coverage. Every domain-specific term is defined within the document or is unambiguous common English. | All terms defined | 1 or 2 terms undefined | 3 or more terms undefined, or a critical term undefined |
| D1.2, No naked references. The document does not direct the reader to an external source without first providing enough local context. | No naked references | 1 soft reference ("see also") with local context | A reference that is load-bearing for understanding |
| D1.3, Identifier legibility. Field names, enum values and identifiers that affect behaviour are described, not just named. | All identifiers described | Key identifiers described, minor ones bare | Core identifiers bare |
| D1.4, Acronym expansion. All acronyms and initialisms are expanded on first use. | All expanded | 1 or 2 minor unexpanded | A critical acronym unexpanded |

---

## D2, Buildability (Weight: 12%)

> Given only this document, could an engineer implement a correct first version?

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D2.1, Input identification | Inputs and shape clear | Inputs named but shape vague | No meaningful input description |
| D2.2, Output identification | Outputs and shape clear | Outputs named but shape vague | No meaningful output description |
| D2.3, Core logic legibility | Core logic clear and specific | General description | Opaque |
| D2.4, Dependency roles | All deps named and role described | Deps named, roles partial | Deps unnamed or roles absent |
| D2.5, Error and edge cases | Key failure modes covered | Some edge cases | No failure modes |

---

## D3, Semantic Coherence (Weight: 8%)

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D3.1, What and How alignment | Fully consistent | Minor inconsistency | Contradiction |
| D3.2, Why and How achievability | All purposes achievable | Most achievable | Why claims outcomes How cannot deliver |
| D3.3, Who and How consistency | Fully consistent | Minor gap | Contradiction |
| D3.4, Gotchas grounded in reality | All traceable | One orphaned | A Gotcha contradicts the How |
| D3.5, Scope and What/How consistency | Fully consistent | Minor additions | Scope includes unsupported items |

---

## D4, Boundary Precision (Weight: 10%)

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D4.1, Explicit exclusions | 2 or more specific exclusions | 1 specific or 2 vague | No meaningful exclusions |
| D4.2, Adjacent system routing | All exclusions forward to a named system | Most forwarded | Dead ends |
| D4.3, Routeability test | High confidence routing for 5 plausible features | Edge cases ambiguous | Ambiguous for obvious cases |
| D4.4, Correct categorisation | Bounds content in Bounds, not Gotchas | Minor bleed | Systematic confusion |

---

## D5, Outcome to Output Correlation (Weight: 8%)

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D5.1, Why to How traceability | Full traceability | Most outcomes traced | Outcomes with no mechanism |
| D5.2, How to Why coverage | All mechanisms purposeful | Most purposeful | Mechanisms with no stated purpose |
| D5.3, Implicit success criteria | Clear implicit criteria | Partially inferrable | Entirely opaque |

---

## D6, Definitional Completeness (Weight: 5%)

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D6.1, No orphan terms | No orphans | 1 or 2 minor orphans | A critical term is orphaned |
| D6.2, Ticket reference legibility | All explained locally | Most explained | Bare ticket references |
| D6.3, Constraint resolution | Constraints with specificity | Key constraints described | Constraints implied |

---

## D7, Actionability and Anti-Patterns (Weight: 5%, revised)

> Could a reader act on this domain immediately and correctly after reading this capsule, and do they know what not to do at the concept level?

D7 absorbs the v1.0 Actionability tests and adds a specific test for named anti-patterns. The "minimum data hops" routing concern lives in D11; D7 retains the high-level routeability and contributor-readiness tests and adds concept-level anti-patterns (not service-routing anti-patterns, which are tested by D11.4).

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D7.1, LLM routeability. An LLM could answer "is X part of this concept?" with high accuracy using only this capsule. | High accuracy expected | Edge cases likely wrong | Low accuracy |
| D7.2, New contributor readiness. A developer new to the domain could make a correct first change after reading. | Ready to contribute | Mostly ready | Significant verbal transfer needed |
| D7.3, Gotcha specificity. Gotchas are actionable warnings, not general cautions. | All specific and actionable | Most actionable | Vague |
| D7.4, Concept-level anti-patterns named. The capsule names at least two concrete wrong-uses of the concept itself (not service routing): common misuses, deprecated patterns, things people frequently confuse with this, with a one-line rationale each. | 2 or more named with rationale | 1 named | None named |

---

## D8, Data Origin (Weight: 7%, new, soft N/A)

> Does the capsule make explicit what data backs the concept, where it lives, how fresh it is and what shape it takes?

D8 is satisfied by a `### How, Data Origin` table with one row per source carrying System, Fully-qualified name, Provenance (canonical or legacy), Freshness, Grain, Known issues.

**Soft N/A:** a capsule whose concept has no warehouse-resident data satisfies D8 by declaring this in the same table (one row: "No warehouse-resident data backs this concept"). When absence is declared via D8.6, only D8.6 is scored; the dimension scores 2/2 = 100% on clean declaration. The other D8 tests are not applied. A blank section or a missing table scores Fail (not N/A).

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D8.1, Sources named. Every data source is named with system, database, schema and (where relevant) table, view or collection. | All sources fully qualified | Most qualified | Sources unnamed or only described in prose |
| D8.2, Source provenance. For each source, the upstream system and the legacy or canonical status are stated. | All sources have provenance | Most do | Provenance silent |
| D8.3, Freshness. Refresh cadence, lag tolerance, or "point-in-time" is stated. | Per-source freshness | Stated overall | No freshness statement |
| D8.4, Shape and granularity. The grain and the canonical join keys are described. | Grain plus keys per source | Grain partial | Grain implied, not stated |
| D8.5, Known data quality issues. At least one concrete caveat per source where one exists. | Caveats per source | One general caveat | No caveats |
| D8.6, Absence declared. A concept with no warehouse-resident data states that explicitly. | Explicit declaration | Implied | Section silent |

---

## D9, Usage Intent and Access Rights (Weight: 12%, gate 80%, new)

> Does the capsule declare who is allowed to act on this concept, at which Capability Layer, under which Involvement Tier, and does that declaration align with `AI_GOVERNANCE_CONTEXT.md`?

D9 is satisfied by a `### How, Usage Intent` table with rows for Capability Layer, Tier eligibility, Consumer roles, Hard boundaries. The values in those rows are validated by the classifier against the named sections of `AI_GOVERNANCE_CONTEXT.md` (served by `governance.aida.duel.tech`) at evaluation time.

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D9.1, Capability layer declared. Drawn from `§ Capability Spectrum`. | Declared and valid | Declared but ambiguous | Absent or not in the spectrum |
| D9.2, Tier eligibility declared. Drawn from `§ Involvement Tiers`. | Declared and valid | Mentioned in prose only | Silent or invalid |
| D9.3, Consumer roles enumerated, each with effective access. | Roles plus access enumerated | Roles named without access | Consumer roles unspecified |
| D9.4, Hard-boundary alignment. The capsule lists every boundary currently enumerated in `§ Hard Boundaries` and marks each as Honoured, Not Applicable, or Conflict. Any Conflict fails outright (see Governance Gate). | All boundaries addressed | Some addressed | Boundaries unaddressed |
| D9.5, RBAC binding hooks. For each declared Capability Layer and Tier combination, the dataset classes a consumer may read and may write are stated (or the capsule explicitly references the matrix in `AI_GOVERNANCE_CONTEXT.md`). | Per-combination rules or explicit reference | Partial rules | No machine-actionable binding |
| D9.6, Sensitivity matches data. The capsule's Sensitivity meta-field is appropriate for the data classes named under D8 and the consumer roles named under D9.3. | Sensitivity matches | Slightly conservative or slightly loose | Mismatched |

> Cross-reference rule: a capsule whose Capability Layer is anything other than "none declared" MUST reference `governance.aida.duel.tech` (or the slug `ai-governance`) in its References meta-field. Capsules that do not reference governance fail D9 by default.

---

## D10, Use Cases and Risk Treatment (Weight: 5%, new)

> Does the capsule enumerate concrete use cases with risk category and applied controls, aligned to the ISO 42001 register pattern?

D10 is satisfied by a `### How, Use Cases` table with columns Use case, Capability layer, Eligible tiers, Risk category, Treatment, Audit. Risk category values are validated against `AI_GOVERNANCE_CONTEXT.md § Risk Categories`.

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D10.1, Use cases enumerated. The capsule lists concrete use cases (not capabilities). | 2 or more listed | 1 listed | No use cases |
| D10.2, Risk category assigned. Each use case is tagged with a value from `§ Risk Categories`. | All categorised | Most | None or invalid values |
| D10.3, Risk treatment described. For each non-Minimal use case, the capsule names the applied controls. | All non-Minimal treated | Some | Risks named, treatment silent |
| D10.4, Audit obligations stated. May reference the audit pattern in `AI_GOVERNANCE_CONTEXT.md`. | Stated per use case or referenced | Stated overall | Audit silent |
| D10.5, Prohibited use named. Where the concept could be misused at a higher capability layer than authorised, the forbidden use is named. | Named | Implied | Not addressed |

---

## D10A, A-Risk Use Case Treatment (Weight: 6%, gate 80%, new, soft N/A)

> For use cases at the Agentic or Artificial layer (per `§ Capability Spectrum`), or marked High in `§ Risk Categories`, is the applied risk treatment of sufficient depth?

D10A applies only to use cases qualifying as A-risk. The capsule satisfies D10A by listing each A-risk use case in the `### How, Use Cases` table and adding a final row labelled "A-risk use cases" with one cell per qualifying use case, each enumerating the applied controls.

**Soft N/A:** a capsule with no A-risk use cases passes D10A by stating "no A-risk use cases" in the A-risk row. When absence is declared via D10A.6, only D10A.6 is scored; the dimension scores 2/2 = 100% on clean declaration. The other D10A tests are not applied. A missing A-risk row scores Fail (not N/A).

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D10A.1, A-risk inventory complete. Every Agentic, Artificial or High-risk use case in the table is flagged as A-risk and appears in the A-risk row. | All flagged | One missed | Multiple missed or A-risk row absent |
| D10A.2, Named confidence threshold or decision boundary. For each A-risk use case, the threshold below which the AI must not act (or the decision boundary above which it may) is named. | Stated per A-risk use case | Stated overall | Not stated |
| D10A.3, Named human-in-the-loop or appeal route. For each A-risk use case, the human review point (synchronous gate, post-hoc review, advocate appeal) is named with the role accountable. | Stated per A-risk use case | Stated overall | Not stated |
| D10A.4, Named kill switch or rollback path. For each A-risk use case, the mechanism by which the action can be reversed, halted or rolled back is named. | Stated per A-risk use case | Stated overall | Not stated |
| D10A.5, Named monitoring signal. For each A-risk use case, the metric or alert that would surface drift, failure or misuse is named. | Stated per A-risk use case | Stated overall | Not stated |
| D10A.6, Absence declared. A capsule with no A-risk use cases states this explicitly. | Explicit declaration | Implied | Row silent |

D10A's strict gate (80%) reflects the asymmetry: an A-risk use case without named treatment is the kind of capsule that fails an ISO 42001 audit and the kind of capsule that produces real-world harm. Minimal and Limited use cases tolerate looser treatment; A-risk does not.

---

## D11, Agent Orchestration (Weight: 10%, strict, universal, new)

> For the typical question this concept answers, which service should an agent call, with what call signature, in what sequence, and which calls must it avoid?

D11 is satisfied by a `### How, Agent Orchestration` block in the `## How` section with three labelled tables: Primary, Sequence (optional, required when the answer spans services), Avoid.

D11 references the three canonical Duel agent services. These are the entry points an AI agent should reach for. Protocol (REST or MCP) is orthogonal; the service URL is the contract.

| Service URL | Purpose |
|---|---|
| context.aida.duel.tech | Capsule lookup, semantic search across the corpus, graph navigation. Every concept is reachable here as a fallback entry point. |
| data.aida.duel.tech | Warehouse-resident metric queries and structured data. Applies brand RLS and exclusion sets. |
| governance.aida.duel.tech | Policy lookup, capability and tier validation, use-case register queries, hard-boundary checks. Serves `AI_GOVERNANCE_CONTEXT.md` and this framework. |

A capsule with no other consumption path passes D11 by naming `context.aida.duel.tech/context_get(slug=...)` as the Primary entry point. A metric capsule names `data.aida.duel.tech` as Primary with the actual call signature. A governance-policy-bearing capsule names `governance.aida.duel.tech`. Every capsule must name something.

| Test | Pass | Partial | Fail |
|------|------|---------|------|
| D11.1, Primary entry point named. The capsule names one of the three Duel services as the canonical entry point for the typical question this concept answers. | Service named and matches the concept's domain | Service named but mismatched to concept | Primary not named |
| D11.2, Call signature with example. The Primary table gives endpoint path or MCP tool name, the key arguments, the expected response shape, AND a worked example call. | Service URL plus example call | Service plus tool name only | Service only, no call shape |
| D11.3, Call sequence given where required. When answering the typical question spans more than one service, the capsule gives the ordered sequence of calls. Single-service concepts may omit. | Sequence present and ordered | Sequence partial or ambiguous | Multi-service concept with no sequence |
| D11.4, Avoid clause present. The capsule names at least one service or path an agent must not use for this concept, with a one-line rationale. This is service-routing anti-patterns; concept-level anti-patterns live in D7.4. | At least one avoid entry with rationale | Avoid entry without rationale | No avoid clause |
| D11.5, Anti-pattern alignment. The avoid entries in D11 are consistent with the anti-patterns named in D7.4 (no contradictions). | Fully aligned | Minor mismatch | Contradiction between D7.4 and D11.4 |

---

## D7 to D11 worked example

A `## How` section for a metric capsule (GAV) might contain:

```
### How, Usage Intent

| Row | Value |
|---|---|
| Capability Layer | Algorithmic |
| Tier eligibility | Internal only |
| Consumer roles | AIDA agent (read aggregate), brand operator dashboards (read aggregate via Augmented surface), advocates (no access) |
| Hard boundaries | HB1 Honoured, HB2 Honoured, HB3 Honoured (read-only), HB4 Honoured |

### How, Data Origin

| System | Name | Provenance | Freshness | Grain | Known issues |
|---|---|---|---|---|---|
| Snowflake | DUEL_INTELLIGENCE.CORTEX_SERVICES.GAV_SEMANTIC_VIEW | Canonical | 15 min lag | Per brand per day | Excludes brands listed in EXCLUDED_BRANDS reference table |

### How, Agent Orchestration

Primary:

| Service | Call | Example |
|---|---|---|
| data.aida.duel.tech | GET /metrics?slug=gav&brand=...&from=...&to=... | curl 'https://data.aida.duel.tech/metrics?slug=gav&brand=acme&from=2026-04-01&to=2026-04-30' |

Sequence (composite question, "what is GAV and why is it down"):

| Step | Service | Call |
|---|---|---|
| 1 | context.aida.duel.tech | context_get(slug=gav) |
| 2 | data.aida.duel.tech | GET /metrics?slug=gav&brand=... |
| 3 | data.aida.duel.tech | GET /metrics?slug=gav&brand=...&breakdown=tier |

Avoid:

| Service or path | Reason |
|---|---|
| Direct Snowflake to DUEL_RAW.MONGODB.* | Legacy Airbyte source, stale. data.aida.duel.tech is canonical. |
| governance.aida.duel.tech for GAV figures | Governance answers policy and capability, not metrics. |

### How, Use Cases

| Use case | Capability layer | Eligible tiers | Risk category | Treatment | Audit |
|---|---|---|---|---|---|
| AIDA GAV query | Algorithmic | Internal only | Limited | RLS via semantic view; confidence threshold on Analyst | AIDA Linear ticket per session |
| Brand dashboard GAV tile | Augmented | 1, 2, 3, 4 | Limited | Pre-aggregated, brand-scoped RLS | Standard platform audit |

A-risk use cases: none.
```

A non-data capsule (a Team capsule, say) satisfies the same dimensions with:

```
### How, Data Origin

| System | Name | Provenance | Freshness | Grain | Known issues |
|---|---|---|---|---|---|
| (none) | No warehouse-resident data backs this concept | n/a | n/a | n/a | n/a |

### How, Agent Orchestration

Primary:

| Service | Call | Example |
|---|---|---|
| context.aida.duel.tech | context_get(slug=...) | curl 'https://context.aida.duel.tech/api/context/get?slug=team-data-and-ai' |

Avoid:

| Service or path | Reason |
|---|---|
| data.aida.duel.tech | This concept is descriptive, not data-backed. data has nothing to return. |
```

The framework treats both as full passes for D8 and D11 because absence is declared rather than skipped.

---

## Framework behaviour across capsule kinds

The framework is intended to score fairly across capsule kinds. Two archetypes illustrate the contrast: a data-heavy metrics capsule (GAV) and a pure business capsule (Team Data and AI). Both can reach the 80% overall gate when well-authored. Neither gets a free pass.

### Side-by-side dimension expectations

| Dimension | Weight | GAV (data-heavy metric) | Team Data and AI (pure business) |
|---|---|---|---|
| D1 Self-Containment | 12% | High: defines GAV vs CAV vs NetRev, brand exclusions, all identifiers | High: defines team scope, roles, programme acronyms |
| D2 Buildability | 12% | High: an analyst can write the SQL after reading | Lower stakes: read as "could a new hire join this team after reading?" |
| D3 Semantic Coherence | 8% | Standard | Standard |
| D4 Boundary Precision | 10% | High: GAV is not CAV; CAV lives in CAV_CONTEXT.md | High: this team is not the data engineering team; that lives in TEAM_DATAENG_CONTEXT.md |
| D5 Outcome to Output | 8% | Standard | Standard |
| D6 Definitional Completeness | 5% | Standard | Standard |
| D7 Actionability and Anti-Patterns | 5% | "Don't use GAV without applying EXCLUDED_BRANDS"; "Don't sum across brands without RLS" | "Don't ask this team to own data engineering"; "Don't confuse 'AI strategy' with 'AI governance'" |
| D8 Data Origin | 7% | Full table: Snowflake semantic view, canonical, 15 min lag, per brand per day, known DQ exclusions | Soft N/A: "no warehouse-resident data backs this concept" |
| D9 Usage Intent and Access | 12% | Algorithmic, Internal only, AIDA + brand dashboards, HB1-4 all Honoured | Augmented (the capsule itself is the data), all internal roles read, no write, HB1-4 mostly Not Applicable |
| D10 Use Cases and Risk | 5% | AIDA GAV query (Limited), brand dashboard tile (Limited) | "This concept enables no direct AI use cases. Downstream capsules: ROADMAP_CONTEXT, DATA_STRATEGY_CONTEXT" |
| D10A A-Risk Treatment | 6% | Soft N/A: no A-risk use cases | Soft N/A: no A-risk use cases |
| D11 Agent Orchestration | 10% | Primary `data.aida.duel.tech` with example call; Sequence for composite questions; Avoid raw warehouse | Primary `context.aida.duel.tech/context_get(slug=team-data-and-ai)`; Avoid `data.aida.duel.tech` and `governance.aida.duel.tech` |

### Expected score distribution

**Well-authored GAV capsule** (illustrative):

```
D1   100%  D5    83%  D9   100%
D2    90%  D6   100%  D10  100%
D3   100%  D7    88%  D10A 100% (soft N/A)
D4   100%  D8   100%  D11  100%

Weighted overall: ~96%
Governance Gate: pass
```

**Well-authored Team capsule** (illustrative):

```
D1   100%  D5    83%  D9    92%
D2    80%  D6   100%  D10  100%
D3   100%  D7   100%  D10A 100% (soft N/A)
D4   100%  D8   100% (soft N/A)  D11  100%

Weighted overall: ~95%
Governance Gate: pass
```

Both archetypes can comfortably clear 80% without distortion. The soft-N/A on D8 and D10A means the Team capsule isn't penalised for the absence of data or A-risk use cases, while D9, D10 and D11 remain hard so the Team capsule still has to declare intent, point at downstream use cases, and provide an agent entry point.

### Where each archetype is hardest

- **GAV** is hardest on D8 (full source table with provenance and known DQ issues), D9.5 (RBAC binding for Algorithmic x Internal), D11.3 (composite call sequence for "what is GAV and why is it down?"), and D11.4 (avoiding the legacy warehouse paths).
- **Team Data and AI** is hardest on D4 (clearly excluding the data engineering team and analytics teams), D7.4 (concept-level anti-patterns: "don't confuse this team's remit with X"), and D9.3 (consumer roles for a concept that's about people).

### What this exposes about the framework

The framework treats data-heavy and business capsules symmetrically *for the dimensions that apply to both*, and explicitly opts out the dimensions that don't. The risk that surfaces is the opposite of fairness: a business capsule has fewer dimensions where it can fail and may score higher than it deserves if the universally-applicable dimensions (D1, D4, D7, D9, D11) aren't pressed hard. The mitigation is in the rubric for those dimensions: D9 demands a Capability Layer even when "Automation (descriptive)" is the only honest answer; D11 demands a worked example call even when it's a single `context_get`. The framework should be calibrated in the pilot to make sure those dimensions are strict enough to keep business capsules honest.

---

## Scoring example

```
D1   Self-Containment                     7/8   = 87.5%  pass
D2   Buildability                         8/10  = 80.0%  pass
D3   Semantic Coherence                   9/10  = 90.0%  pass
D4   Boundary Precision                   7/8   = 87.5%  pass
D5   Outcome to Output                    5/6   = 83.3%  pass
D6   Definitional Completeness            5/6   = 83.3%  pass
D7   Actionability and Anti-Patterns      6/8   = 75.0%  pass (gate 60%)
D8   Data Origin                          9/12  = 75.0%  pass (gate 70%)
D9   Usage Intent and Access              8/12  = 66.7%  fail (gate 80%)
D10  Use Cases and Risk                   6/10  = 60.0%  fail (gate 70%)
D10A A-Risk Use Case Treatment            4/12  = 33.3%  fail (gate 80%)
D11  Agent Orchestration                  7/10  = 70.0%  pass (gate 70%)

Weighted overall:                                75.4%   below 80%
Governance Gate:                                 fail (no reference to governance.aida.duel.tech)

Verdict: significant revision needed. Priorities:
  1. D10A: each A-risk use case is missing confidence threshold, kill switch and monitoring. A-risk has the strictest gate; treat first.
  2. D9: declare Capability Layer and Tier eligibility, reference governance.aida.duel.tech.
  3. D10: enumerate use cases with risk categories drawn from § Risk Categories.
```

---

## The Turing Test, final qualitative gate

After scoring, apply the following. It cannot be automated, but it is the most honest signal:

> Read the capsule in full. Then close it. Now answer:
>
> 1. What does this concept do? *(What)*
> 2. Who consumes it, at which Capability Layer, under which Tier, and why? *(Who plus Why plus D9)*
> 3. What are the three things most likely to go wrong when building or changing it? *(Gotchas)*
> 4. Name one thing that is explicitly NOT part of this concept, and where it lives instead. *(Bounds)*
> 5. For the typical question this concept answers, which exact downstream call should the agent make, and which calls must it avoid? *(D11)*
> 6. Name one concrete use case this concept enables, its risk category, and the applied control. *(D10)*
> 7. If this concept has any A-risk use cases, name one, and for it state the threshold, the human-in-the-loop, and the kill switch. If it has none, confirm that and say why. *(D10A)*
> 8. Could a Tier-1 brand's AI tools consume this concept? If not, why not? *(D9.2)*

A capsule that scores at or above 80% overall but fails questions 5, 6, 7 or 8 should be revised. Scoring is a guide; the eight-question Turing test is the verdict.

---

## The Governance Gate (hard)

Independent of dimension scores, a capsule fails outright if any of the following hold. The values referenced live in the named sections of `AI_GOVERNANCE_CONTEXT.md` (served by `governance.aida.duel.tech`) and are read by the classifier at evaluation time.

- The concept involves data access or AI action and the capsule does not reference `governance.aida.duel.tech` (or the slug `ai-governance`) in its References meta-field.
- The capsule names a use case without a risk category drawn from `§ Risk Categories`.
- The capsule names a use case at the Agentic or Artificial layer (per `§ Capability Spectrum`) or marked High (per `§ Risk Categories`) without naming all four of: confidence threshold or decision boundary, human-in-the-loop or appeal route, kill switch or rollback path, monitoring signal. This is the D10A failure restated as a hard gate.
- The capsule's Sensitivity meta-field is inconsistent with the data classes it names in `### How, Data Origin`.
- The capsule marks any row in `§ Hard Boundaries` as Conflict.

These are contractual conditions for the capsule to be served by `context.aida.duel.tech` and indexed for AI consumption.

---

## Automation strategy

The classifier (`@dueltech/capsule-classifier`) and the context service (`context.aida.duel.tech`) implement this framework in four phases.

**Phase 1, structural and reference pre-check (deterministic).**
Verify presence of the canonical `### How,` tables in the capsule body, presence of a reference to `governance.aida.duel.tech` where required, and that the declared Capability Layer and Tier values are members of the live governance enumerations (fetched from `governance.aida.duel.tech` at evaluation time). Reject obvious failures before spending LLM tokens.

**Phase 2, semantic dimension scoring (Haiku LLM pass).**
A Haiku-driven evaluator fetches the live `AI_GOVERNANCE_CONTEXT.md` from `governance.aida.duel.tech`, extracts the four canonical sections as enumerations, and scores each test against the rubric. The classifier emits a JSON report.

**Phase 3, cross-capsule consistency (graph pass).**
Read the per-repo `<repo>.context-capsule` index. For each capsule: check that all References targets exist; that any capsule whose data classes overlap governance references it; that Capability and Tier declarations are consistent with the use-case register; that orphaned capsules are surfaced for review.

**Phase 4, uninformed-reader Turing pass.**
A second LLM call with no context other than the capsule attempts the eight Turing questions. Answers are scored against ground-truth Q&A prepared by the capsule owner. Required for any capsule scoring 70 to 85%.

---

## Version and Metadata

| Field | Value |
|-------|-------|
| Framework version | 1.1 |
| Status | For review |
| Owner | Owen Tribe (Head of Data and AI), owen@duel.tech |
| Updated | 13 May 2026 |
| Supersedes | v1.0 (13 April 2026) |
| Served by | `governance.aida.duel.tech` |
| Binds against | `AI_GOVERNANCE_CONTEXT.md § Capability Spectrum`, `§ Involvement Tiers`, `§ Hard Boundaries`, `§ Risk Categories`. Late binding: values are read from `governance.aida.duel.tech` at classifier evaluation time, not inlined here. |
| Based on | contextcapsules.com v1, Duel platform capsule format, ISO 42001 (planned 2026), EU AI Act risk tiering |
| References | `AI_GOVERNANCE_CONTEXT.md`, `CONTEXT_CAPSULES.md`, `CAPSULE_SPEC.md`, `@dueltech/capsule-classifier` v0.2+ |
| Next review | After first pilot run on the AIDA capsule corpus and `AI_GOVERNANCE_CONTEXT.md` |

### Lineage

- **v1.0 (13 April 2026).** Initial Duel framework, seven dimensions: Self-Containment, Buildability, Semantic Coherence, Boundary Precision, Outcome to Output Correlation, Definitional Completeness, Actionability. Based on the contextcapsules.com v1 spec extended with Duel platform capsule format. Served as the in-plugin verifier reference for the `context-capsule` plugin.
- **v1.1 (13 May 2026).** First release that ships with Duel's governance binding. Adds D8 (Data Origin) with soft N/A, D9 (Usage Intent and Access Rights, gate 80%), D10 (Use Cases and Risk Treatment), D10A (A-Risk Use Case Treatment, gate 80%, soft N/A), D11 (Agent Orchestration, universal). Adds the Governance Gate. Late-binds capability vocabulary, tier vocabulary, hard boundaries and risk categories to `AI_GOVERNANCE_CONTEXT.md` served by `governance.aida.duel.tech`. Framework now lives in the governance repo and is read by the classifier at evaluation time rather than bundled into the plugin.
