Why a separate process for GenAI features?
Conventional discovery assumes deterministic software: the same input always produces the same output, and "done" is binary. GenAI features break both assumptions. Output is probabilistic, quality exists on a spectrum, the system changes after deployment, and failure modes are often invisible. Applying a standard discovery process to GenAI features produces unclear acceptance criteria, cost overruns from token usage, and painful post-launch conversations about why the AI "isn't working."
This process activates when a proposed feature is identified as GenAI-powered. It runs alongside — and extends — Tkxel's standard discovery process. Non-GenAI features in the same engagement follow the standard path. GenAI features follow this process from Stage 1 through to a signed Feature Canvas.
Process at a Glance — What each stage produces
Stage 1
Feature Classification Register
Stage 2
Signed Expectation & IP Consent
Stage 3
HITL Map + Failure Map + Prompt Brief
Stage 4
3-Layer AC Document + Rubric
Stage 5
Cost Estimate + Scores + Roadmap
Stage 6
Post-Launch Health Plan
All outputs from Stages 2–6 are consolidated into the GenAI Feature Canvas — the primary sign-off artifact. One canvas per feature. Signed by both parties before any development begins on that feature.
Feature Classification Gate
Determine whether a feature genuinely requires GenAI, and if so, whether it should automate or augment a user's task. This is the most important decision in the entire process — it prevents over-engineering and sets the right expectation before scoping begins.
The right question is not "Can we use AI to do this?" — it is "Would a simpler rule-based solution serve this need equally well?" A rule-based approach is easier to build, explain, debug, maintain, and sign off. GenAI adds value only where the task genuinely requires it.
✅ GenAI is probably the right choice when…
- The task involves natural language understanding or generation
- Output must be personalised per user or context at scale
- Patterns in data change over time and cannot be hard-coded
- The task requires reasoning across unstructured content
- The "right" answer cannot be fully specified in advance
- A conversational or agent-based experience is the core feature
❌ A rule-based approach is probably better when…
- Predictability is critical — users must get the same result every time
- Information is static or limited in variation
- The cost of errors is very high relative to the benefit
- Full transparency and auditability of every decision is required
- Speed to market matters more than quality variance tolerance
- Users have explicitly said they do not want this task automated
Once GenAI is confirmed as the right approach, decide whether the feature should fully automate a task or augment a person's ability to do it better. This decision shapes the HITL design, the acceptance criteria, and the level of human oversight required.
🤖 Automate when…
- The task is repetitive, tedious, or high-volume
- Users are comfortable fully delegating it
- There is broad agreement on what "correct output" looks like
- Human oversight can be occasional rather than continuous
Typical examples
Meeting summaries · document classification · email triage · report generation · code documentation
🧑💻 Augment when…
- The user values doing the task themselves — AI assists, not replaces
- Personal responsibility or accountability for the output matters
- Stakes are high (legal, financial, medical, reputational)
- The user has a creative vision they want to execute
Typical examples
Proposal writing assistant · design co-pilot · code suggestion · sales call coaching · contract review aid
Users almost always underspecify what they want from an AI feature. They state a surface request but leave critical constraints, sub-goals, and edge conditions unstated — not because they are being vague, but because they assume the system will understand context the way a human would. Uncovering this before design begins prevents the two most common GenAI failures: a model that optimises for the wrong thing, and a model that confidently acts on unintended goals.
🎯 Primary Goal
What is the user's actual underlying goal — not the surface request?
User says: "Show me a variety of running trails"
Primary goal: Stay engaged with running so they don't quit — variety is the mechanism, not the goal itself.
🔗 Sub-goals & Dependencies
What must the user solve before or alongside the primary goal? These are often invisible to the user but critical to a safe, useful output.
Before choosing a trail, the user needs to: warm up correctly for the terrain type, ensure they have water and gear, and confirm the route is safe for their fitness level.
❓ Underspecification
What critical information does the user assume the AI already knows — but has not stated? If the model fills these gaps incorrectly, the output will be wrong even if it looks right.
Unstated for "show me running trails": physical limitations, fitness level, geographic location, preferred duration, personal safety concerns, access to routes.
⚖️ Optimisation Conflicts
Can optimising for one goal silently compromise another? This is where AI features cause harm without anyone noticing — the model achieves the stated goal at the expense of an unstated one.
Optimising for "variety and engagement" could lead the model to recommend increasingly challenging or dangerous trails — maximising the metric while undermining the user's safety and fitness goals.
Feature Classification Register
- All proposed features listed with GenAI / non-GenAI decision recorded
- Each confirmed GenAI feature labelled: Automate / Augment / Hybrid
- Features where GenAI is not justified returned to standard discovery backlog
- Confirmed list of GenAI features to proceed through Stages 2–6
Expectation Alignment Client Sign-off
Have the honest conversation before a line of code is written. Cover GenAI limitations, data governance, IP protection, and what "good output" looks like. Most delivery problems with GenAI features originate here — teams skip this conversation and have it in UAT instead.
Run this as a verbal workshop exercise, not a document review. Have the client complete the 4-line statement below out loud, in the room. If they cannot complete a line, it becomes a risk item to resolve before scoping proceeds. The exercise surfaces mismatched expectations before they become contract disputes.
The facilitator reads each prompt aloud. The client stakeholder completes it. Answers are documented and become the expectation baseline for this feature.
Walk through each item with the client. A check means the client understands and accepts the limitation. Any unchecked item must be discussed and resolved before the engagement proceeds.
- Hallucination: The model may generate confident, plausible, but factually incorrect output. Human review is required before any AI-generated content is used in a business context.
- Non-determinism: The same input may produce different outputs on different runs. A response that passes in testing does not guarantee identical results in production.
- Context window limits: The model can only process a limited amount of text at once. Very long documents may be truncated. In long conversations, early context may be lost.
- Latency: GenAI responses are slower than deterministic APIs — typically 2–15 seconds. The UI must be designed with loading states, progress indicators, and cancellation in mind.
- Model updates: The underlying model provider may update their model without notice. Output quality or style may change without any code change on our side. Ongoing monitoring is required.
- Content refusals: The model may decline to generate certain content based on its safety training. Edge cases must be discovered and handled in the test bank before go-live, not in UAT.
- Cost variability: Token-based pricing means costs scale with usage volume and prompt length. A spike in users or longer inputs will increase costs. Usage monitoring must be in place from day one.
This must be signed before any GenAI development begins. Client IP and sensitive data must never enter model training pipelines or be exposed in shared inference contexts. This consent agreement clarifies what data can and cannot be used with external model APIs.
✅ Permitted
- Using client-approved, non-sensitive content as retrieval context (RAG)
- Sending anonymised or synthetic data to model APIs for development and testing
- Using public documentation as context for code generation
- Processing data via zero-data-retention API tiers where the provider does not train on inputs
❌ Not Permitted (without explicit written approval)
- Sending PII, credentials, or sensitive business data to any external model API
- Fine-tuning any model on client IP or proprietary data
- Using client data in shared or multi-tenant inference environments
- Storing client conversation data on third-party AI platforms
Signed Expectation Alignment Document
- Completed 4-line Expectation Statement per feature — documented and client-confirmed
- Limitations Disclosure Checklist — all items checked and signed by client stakeholder
- IP & Data Governance Consent form signed before development begins
Feature Design & Human-in-the-Loop Mapping
Design the feature using GenAI-specific artifacts. These extend or replace conventional wireframes and user stories for GenAI features. Three artifacts are required: a Human-in-the-Loop Map, a Failure Mode Map, and a Prompt Design Brief.
For every GenAI feature, map where and how humans interact with, override, and improve the AI across four zones. HITL design is not optional — it is an architectural and UX decision that affects build effort, cost, and the acceptance criteria defined in Stage 4.
Zone 1: First Use
How is the AI capability introduced? What can it do? What are its limits?
Decisions: onboarding copy, capability display, opt-in or opt-out mechanism, initial trust-building
Zone 2: During Use
How does the human steer, edit, approve, or override AI output in real time?
Decisions: inline editing, confidence indicators, regenerate button, streaming vs batch display
Zone 3: When Things Go Wrong
What happens when output is wrong, refused, or unusable?
Decisions: manual fallback path, error messages, feedback capture, escalation route
Zone 4: Over Time
How do humans improve the system after launch?
Decisions: feedback loop design, prompt tuning cadence, model version monitoring, quality review process
GenAI has failure types that do not exist in conventional software. All three must be mapped. Background errors are the most dangerous — neither the user nor the system notices them, and they can persist for weeks.
⚠️ Visible Failures (user notices)
- Hallucinated facts the user can spot
- Generation refused due to content policy
- Off-topic or nonsensical output
- Timeout or no response returned
Response required: Design error messages, fallback paths, and feedback capture for each type before build begins.
🔴 Background Errors (nobody notices)
- Subtly incorrect facts presented confidently
- Outdated information from stale retrieval context
- Systematic bias in outputs (e.g. always recommends the same approach)
- Silent quality drift after a model provider update
Response required: Active monitoring, automated eval runs against a golden dataset, and human sampling — not just user feedback.
🔵 Context Errors (system works, user unhappy)
- Correct output, wrong timing or context
- Output technically accurate but misses intent
- Cultural or domain-specific mismatch
- Over-personalisation from stale or incorrect signals
Response required: User research and clear explanation of what signals the system uses to generate output.
Prompt design is a scoped deliverable, not an implementation detail. Prompts are the primary mechanism for controlling output quality and behaviour. A prompt change can fundamentally alter a feature. Prompts must be versioned, owned, and included in the change management process — treated the same as code.
Feature Design Package
- HITL Map — all four zones documented with named owners per zone
- Failure Mode Map — visible, background, and context errors with designed mitigations
- Prompt Design Brief — role, guardrails, context method, ownership, and change process defined
- AI Feature Inventory entry (one row per feature: model, data sources, HITL zones, risk level)
Acceptance Criteria Definition Sign-off Gate
Define what "done" and "good enough" mean for probabilistic output — in language a client can sign off on. GenAI acceptance criteria have three distinct layers and use tolerance bands rather than binary pass/fail. This is the single most common gap in GenAI delivery processes, and the most commercially important to get right.
Traditional ACs are binary — it works or it doesn't. GenAI ACs must cover three separate layers. All three must be defined and signed before development begins. Missing any layer creates ambiguity at UAT and a potential contract dispute at go-live.
Layer 1 — Functional ACs
The plumbing works. Deterministic. Binary pass/fail. Standard to test.
- Feature loads and is accessible
- API calls complete successfully
- Output renders in the correct format
- Latency is within defined threshold
- Fallback triggers correctly on failure
- Feedback mechanism records and stores correctly
Layer 2 — Quality ACs
Output meets a defined quality bar. Probabilistic. Expressed as tolerance bands.
- Measured against a pre-built offline test bank
- Scored on agreed dimensions: accuracy, relevance, tone, safety
- Format: "≥X% of outputs score ≥Y on rubric Z"
- Reviewers calibrated on scoring rubric before UAT begins
- Minimum test bank size: 30 prompts (50 recommended)
Layer 3 — Guardrail ACs
The feature never does X. Zero tolerance. Binary — any failure blocks go-live.
- Never outputs content in prohibited categories
- Never includes restricted information (e.g. live pricing, PII)
- Never bypasses mandatory human approval steps
- Never stores conversation data in prohibited locations
"When presented with [input type], the feature must produce output that scores [threshold] or above on [dimension] in [X out of Y] evaluations by [evaluator type], tested against a bank of [N] representative prompts."
- "80% of AI-generated RFP sections must be rated 'Acceptable' or better against the scoring rubric, measured across a test bank of 50 representative prompts."
- "0% of outputs may contain pricing information." (Guardrail — zero tolerance.)
- "Average generation time must be under 8 seconds for a standard section." (Functional — binary.)
- "Responses must be rated 'Helpful' or 'Partially helpful' in at least 85% of test cases by two independent reviewers."
- "Factual accuracy must be ≥90% against the verified knowledge base, confirmed by monthly automated eval run."
- "The feature must never provide medical or legal advice under any circumstances." (Guardrail — zero tolerance.)
Share this with reviewers during calibration before UAT begins. All reviewers must align on scoring definitions before evaluation starts.
| Score | Label | Description | Reviewer Action |
|---|---|---|---|
| 4 — Excellent | Accept | Accurate, relevant, and appropriately toned. Requires minimal or no editing before use. | Accept as-is |
| 3 — Acceptable | Accept with edits | Mostly accurate and relevant. Minor edits needed — wording or completeness — but structure is sound. | Edit then use |
| 2 — Marginal | Reject | Partially useful but significant issues — inaccuracy, off-topic sections, or wrong tone. Requires substantial rework. | Reject — regenerate or write manually |
| 1 — Unacceptable | Reject + Flag | Wrong, harmful, off-brief, or violates a guardrail. Cannot be used in any form. | Reject + flag for investigation |
Signed GenAI Acceptance Criteria Document
- Layer 1 (Functional), Layer 2 (Quality with tolerance bands), Layer 3 (Guardrails) all defined
- Scoring rubric agreed and reviewer calibration plan confirmed
- Test bank scope, ownership, and delivery timeline confirmed
- Go-live threshold defined in writing and signed by client Product Owner
Cost Estimation, Prioritisation & Roadmap
Estimate token costs, score and prioritise GenAI features using Tkxel's extended scoring matrix, and produce a sequenced MVP roadmap. GenAI features require two additional scoring dimensions beyond the standard TXL criteria.
Token costs are usage-based and hard to predict precisely at discovery time. The goal is not precision — it is a defensible range that protects both Tkxel and the client from commercial surprises. Present the estimate as a low/high range with clearly stated assumptions.
Score each GenAI feature across five dimensions. The first three are standard Tkxel criteria. The last two are GenAI-specific additions required for all features passing through this process.
| Dimension | Score 1 — Low | Score 3 — Medium | Score 5 — High |
|---|---|---|---|
| T — Transformation Impact Business alignment & strategic impact | Low strategic alignment, unclear value proposition | Moderate alignment, clear departmental value | Strong strategic fit, material revenue or efficiency impact |
| X — Experience Impact User desirability | Low user desirability, high anticipated change resistance | Users see value, moderate adoption expected | High user demand, strong pull, low resistance |
| L — Launch Feasibility Technical & operational readiness | Data unavailable, governance unclear, high complexity | Data available, governance manageable, moderate complexity | Data ready, clear compliance path, low complexity |
| C — Cost Viability GenAI-specific: token economics | Token costs unacceptable at projected scale, or unknown | Costs within budget with usage controls in place | Low token cost per unit of value delivered, clear agreed cost model |
| R — Reusability GenAI-specific: foundation for future features | One-off feature, no reuse potential identified | Pattern partially reusable in 1–2 future features | Establishes a reusable foundation for multiple future features |
Calculating the 2×2 position: Strategic Business Impact = mean of T + X scores. Execution Fit = mean of L + C scores. Plot each feature on the 2×2 matrix: high-impact high-fit = Accelerate to MVP · high-impact low-fit = Incubate · low-impact high-fit = Quick Win · low-impact low-fit = Shelve. R score acts as a tiebreaker between features with equal positions.
Using the HITL Map from Stage 3, estimate the engineering and UX effort required for human-AI interaction design. This contributes to sprint estimates alongside standard build complexity.
XS — Minimal HITL
Read-only display of AI output. No editing, no explicit feedback mechanism. Human uses or ignores output.
Approx. 0.5 sprint additional effort
S — Basic HITL
Editable output field. Accept or reject action. Simple thumbs up/down feedback. No complex override flows.
Approx. 1 sprint additional effort
M — Standard HITL
Multi-step review flow. Confidence indicators. Audit trail. Regenerate with parameters. Feedback routed to review team.
Approx. 2–3 sprints additional effort
L — Complex HITL
Role-based approval workflow. Full audit log. Admin review dashboard. Integration with existing approval systems.
Approx. 4+ sprints additional effort
GenAI Feature Roadmap
- Extended 5-dimension scores (T, X, L, C, R) for all GenAI features
- 2×2 prioritisation matrix placement: Accelerate / Quick Win / Incubate / Shelve
- Token cost estimates per feature with agreed payment model
- HITL effort sizing (XS/S/M/L) per feature contributing to sprint estimate
- Sequenced MVP roadmap with phasing and accountable owners per feature
Post-Launch Model New — Required for all GenAI features
Define what happens after go-live — before go-live. GenAI features are never "done." Prompts drift, models update, usage patterns change, and quality degrades silently if not monitored. This stage defines the operational and commercial model for ongoing AI health.
The majority of AI features that fail do not fail at launch — they fail 3–6 months later. The root causes are always the same: no measurement, no defined ownership of quality, and no budget for ongoing improvement. This stage prevents that outcome. If it is skipped at discovery, the conversation will happen as an emergency 6 months into production.
📥 Implicit Feedback (collected automatically)
- Accept / reject rate on AI-generated outputs
- Edit distance — how much users change the output before using it
- Regeneration rate — how often users ask for a new output
- Time-to-accept — how long users spend reviewing before acting
- Feature abandonment rate — users who start but don't complete
📣 Explicit Feedback (user-initiated)
- Thumbs up / down on each output
- Optional free-text "what was wrong?" field on rejections
- Monthly in-product satisfaction pulse (CSAT or similar)
- Ability to flag specific outputs as problematic
Every GenAI feature must have at least three defined alert conditions before go-live. Use this format: "If [metric] for [feature] drops below / goes above [threshold], [named person] will take [specific action] within [timeframe]."
Prompt tuning and model monitoring after go-live are not bug fixes. They are a distinct, ongoing category of work requiring their own commercial scope. Clients must understand this before sign-off — post-launch tuning is how a GenAI feature continues to improve, and it must be funded deliberately.
Option A — Included in support retainer
Monthly AI health review and prompt updates included in the standard support retainer. Appropriate for low-complexity features with stable use cases and low feedback volume.
Best suited to: single feature, stable domain, monthly eval cadence is sufficient
Option B — Quarterly improvement sprint
Dedicated 1-sprint engagement per quarter: review feedback trends, update prompts, run regression, produce quality trend report. Scoped and contracted separately per quarter.
Best suited to: multiple features, evolving domain, active user feedback volume
Option C — Continuous AI operations
Dedicated AI engineer on monthly retainer with automated monitoring, proactive tuning, and monthly quality reports. For production features where output quality is business-critical.
Best suited to: high-stakes features, regulated environments, rapid user growth
Post-Launch AI Health Plan
- Feedback design: implicit signals defined + explicit feedback mechanism scoped
- At least 3 alert conditions defined with named owners and response timeframes
- Automated eval cadence and human review cadence agreed
- Tuning engagement model selected (A/B/C) and commercially scoped before go-live
- Client-side and Tkxel-side AI health owners named and confirmed
GenAI Feature Canvas — Primary Sign-off Artifact
One canvas per GenAI feature. This is the consolidated sign-off document for all GenAI features — replacing or extending the conventional feature spec. It captures outputs from all 6 stages on a single page. Both the client Product Owner and the Tkxel Delivery Lead must sign before development begins.
GenAI Feature Canvas — [Feature Name] · v1.0 · [Date]
Feature Identity
Feature name
Classification
Model tier & provider
HITL complexity
Outcome Statement
This feature should improve [measurable metric] by [threshold] within [timeframe] for [user group]…
Expectation Statement
It will help by… / It will not… / Over time… / Users can improve it by…
Human-in-the-Loop Design
Zone 1: first use · Zone 2: during use · Zone 3: when wrong · Zone 4: over time. Named owner per zone.
Failure Mode Map
Top 3 visible failures + mitigation. Background error detection plan. Highest-stakes scenario + control.
Acceptance Criteria — Layer 1 (Functional)
Binary pass/fail ACs. Latency threshold. Fallback trigger. All deterministic conditions listed.
Acceptance Criteria — Layer 2 (Quality)
Tolerance bands: "[X]% of outputs score [Y] on [dimension] across [N] test bank prompts." Rubric attached.
Acceptance Criteria — Layer 3 (Guardrails — zero tolerance)
This feature must NEVER: [list all prohibited output categories]. Tested against full test bank at launch and monitored continuously.
Token Cost Estimate
Low: $__ / month · High: $__ / month
Payment model: client / Tkxel / shared
Feature Score (T / X / L / C / R)
T: __ · X: __ · L: __ · C: __ · R: __
Strategic Impact: __ · Execution Fit: __
Quadrant: Accelerate / QW / Incubate / Shelve
Prompt Design
Owner: __ · Version control: __ · Context method: __ · Change approval: __
Post-Launch Model
Eval cadence: __ · Alert threshold: __ · Tuning model: A / B / C
Client owner: __ · Tkxel owner: __
Sign-off — Required from all three parties before development begins
Client Product Owner
Name · Date
Client Technical Lead
Name · Date
Tkxel Delivery Lead
Name · Date