Tkxel — Internal SOP · GenAI Discovery Workshop · v1.0 · 2026

GenAI Feature Discovery Workshop

A structured 6-stage process for discovering, scoping, and signing off on Generative AI features — with worksheets, checklists, and example outputs for every stage. To be followed by all Tkxel delivery teams when a client engagement includes GenAI features.

Jump to: Overview Stage 1: Classify Stage 2: Align Stage 3: Design Stage 4: Define ACs Stage 5: Cost & Prioritise Stage 6: Post-Launch Feature Canvas

Why a separate process for GenAI features?

Conventional discovery assumes deterministic software: the same input always produces the same output, and "done" is binary. GenAI features break both assumptions. Output is probabilistic, quality exists on a spectrum, the system changes after deployment, and failure modes are often invisible. Applying a standard discovery process to GenAI features produces unclear acceptance criteria, cost overruns from token usage, and painful post-launch conversations about why the AI "isn't working."

This process activates when a proposed feature is identified as GenAI-powered. It runs alongside — and extends — Tkxel's standard discovery process. Non-GenAI features in the same engagement follow the standard path. GenAI features follow this process from Stage 1 through to a signed Feature Canvas.

Process at a Glance — What each stage produces

Stage 1

Feature Classification Register

Stage 2

Signed Expectation & IP Consent

Stage 3

HITL Map + Failure Map + Prompt Brief

Stage 4

3-Layer AC Document + Rubric

Stage 5

Cost Estimate + Scores + Roadmap

Stage 6

Post-Launch Health Plan

All outputs from Stages 2–6 are consolidated into the GenAI Feature Canvas — the primary sign-off artifact. One canvas per feature. Signed by both parties before any development begins on that feature.

1

Feature Classification Gate

Determine whether a feature genuinely requires GenAI, and if so, whether it should automate or augment a user's task. This is the most important decision in the entire process — it prevents over-engineering and sets the right expectation before scoping begins.

Step 1A — Does this feature actually need GenAI?

The right question is not "Can we use AI to do this?" — it is "Would a simpler rule-based solution serve this need equally well?" A rule-based approach is easier to build, explain, debug, maintain, and sign off. GenAI adds value only where the task genuinely requires it.

✅ GenAI is probably the right choice when…

  • The task involves natural language understanding or generation
  • Output must be personalised per user or context at scale
  • Patterns in data change over time and cannot be hard-coded
  • The task requires reasoning across unstructured content
  • The "right" answer cannot be fully specified in advance
  • A conversational or agent-based experience is the core feature

❌ A rule-based approach is probably better when…

  • Predictability is critical — users must get the same result every time
  • Information is static or limited in variation
  • The cost of errors is very high relative to the benefit
  • Full transparency and auditability of every decision is required
  • Speed to market matters more than quality variance tolerance
  • Users have explicitly said they do not want this task automated
Step 1B — Automation or Augmentation?

Once GenAI is confirmed as the right approach, decide whether the feature should fully automate a task or augment a person's ability to do it better. This decision shapes the HITL design, the acceptance criteria, and the level of human oversight required.

🤖 Automate when…

  • The task is repetitive, tedious, or high-volume
  • Users are comfortable fully delegating it
  • There is broad agreement on what "correct output" looks like
  • Human oversight can be occasional rather than continuous

Typical examples

Meeting summaries · document classification · email triage · report generation · code documentation

🧑‍💻 Augment when…

  • The user values doing the task themselves — AI assists, not replaces
  • Personal responsibility or accountability for the output matters
  • Stakes are high (legal, financial, medical, reputational)
  • The user has a creative vision they want to execute

Typical examples

Proposal writing assistant · design co-pilot · code suggestion · sales call coaching · contract review aid

Step 1C — User Intent & Problem Framing

Users almost always underspecify what they want from an AI feature. They state a surface request but leave critical constraints, sub-goals, and edge conditions unstated — not because they are being vague, but because they assume the system will understand context the way a human would. Uncovering this before design begins prevents the two most common GenAI failures: a model that optimises for the wrong thing, and a model that confidently acts on unintended goals.

🎯 Primary Goal

What is the user's actual underlying goal — not the surface request?

Example

User says: "Show me a variety of running trails"
Primary goal: Stay engaged with running so they don't quit — variety is the mechanism, not the goal itself.

🔗 Sub-goals & Dependencies

What must the user solve before or alongside the primary goal? These are often invisible to the user but critical to a safe, useful output.

Example

Before choosing a trail, the user needs to: warm up correctly for the terrain type, ensure they have water and gear, and confirm the route is safe for their fitness level.

❓ Underspecification

What critical information does the user assume the AI already knows — but has not stated? If the model fills these gaps incorrectly, the output will be wrong even if it looks right.

Example

Unstated for "show me running trails": physical limitations, fitness level, geographic location, preferred duration, personal safety concerns, access to routes.

⚖️ Optimisation Conflicts

Can optimising for one goal silently compromise another? This is where AI features cause harm without anyone noticing — the model achieves the stated goal at the expense of an unstated one.

Example

Optimising for "variety and engagement" could lead the model to recommend increasingly challenging or dangerous trails — maximising the metric while undermining the user's safety and fitness goals.

Worksheet 1C — User Intent & Problem Framing (complete per feature, run in workshop with client)
Write the actual words a user would say or type. Avoid polished product language.
Go one level deeper than the surface request. What outcome makes the user's day better?
List the steps, knowledge, or conditions that the user needs but may not mention. These often become guardrails or context requirements.
These are the assumptions users make. Each one is a potential failure point if the AI guesses wrong. For each item, note how you will surface or handle it.
For each conflict, note the guardrail or design decision that prevents it. These feed directly into Layer 3 (Guardrail) Acceptance Criteria in Stage 4.
Worksheet 1 — Feature Classification
Worksheet 1 — Complete for every proposed GenAI feature
Write the problem first. Do not start with "we want to use AI to…"
Stage 1 Output

Feature Classification Register

  • All proposed features listed with GenAI / non-GenAI decision recorded
  • Each confirmed GenAI feature labelled: Automate / Augment / Hybrid
  • Features where GenAI is not justified returned to standard discovery backlog
  • Confirmed list of GenAI features to proceed through Stages 2–6
2

Expectation Alignment Client Sign-off

Have the honest conversation before a line of code is written. Cover GenAI limitations, data governance, IP protection, and what "good output" looks like. Most delivery problems with GenAI features originate here — teams skip this conversation and have it in UAT instead.

Step 2A — GenAI Limitations Briefing

Run this as a verbal workshop exercise, not a document review. Have the client complete the 4-line statement below out loud, in the room. If they cannot complete a line, it becomes a risk item to resolve before scoping proceeds. The exercise surfaces mismatched expectations before they become contract disputes.

Worksheet 2A — Expectation Statement (complete per feature, client-led, in workshop)

The facilitator reads each prompt aloud. The client stakeholder completes it. Answers are documented and become the expectation baseline for this feature.

Step 2B — Limitations Disclosure Checklist

Walk through each item with the client. A check means the client understands and accepts the limitation. Any unchecked item must be discussed and resolved before the engagement proceeds.

  • Hallucination: The model may generate confident, plausible, but factually incorrect output. Human review is required before any AI-generated content is used in a business context.
  • Non-determinism: The same input may produce different outputs on different runs. A response that passes in testing does not guarantee identical results in production.
  • Context window limits: The model can only process a limited amount of text at once. Very long documents may be truncated. In long conversations, early context may be lost.
  • Latency: GenAI responses are slower than deterministic APIs — typically 2–15 seconds. The UI must be designed with loading states, progress indicators, and cancellation in mind.
  • Model updates: The underlying model provider may update their model without notice. Output quality or style may change without any code change on our side. Ongoing monitoring is required.
  • Content refusals: The model may decline to generate certain content based on its safety training. Edge cases must be discovered and handled in the test bank before go-live, not in UAT.
  • Cost variability: Token-based pricing means costs scale with usage volume and prompt length. A spike in users or longer inputs will increase costs. Usage monitoring must be in place from day one.
Step 2C — IP & Data Governance Consent

This must be signed before any GenAI development begins. Client IP and sensitive data must never enter model training pipelines or be exposed in shared inference contexts. This consent agreement clarifies what data can and cannot be used with external model APIs.

✅ Permitted

  • Using client-approved, non-sensitive content as retrieval context (RAG)
  • Sending anonymised or synthetic data to model APIs for development and testing
  • Using public documentation as context for code generation
  • Processing data via zero-data-retention API tiers where the provider does not train on inputs

❌ Not Permitted (without explicit written approval)

  • Sending PII, credentials, or sensitive business data to any external model API
  • Fine-tuning any model on client IP or proprietary data
  • Using client data in shared or multi-tenant inference environments
  • Storing client conversation data on third-party AI platforms
Worksheet 2C — IP & Data Governance Consent
Stage 2 Output

Signed Expectation Alignment Document

  • Completed 4-line Expectation Statement per feature — documented and client-confirmed
  • Limitations Disclosure Checklist — all items checked and signed by client stakeholder
  • IP & Data Governance Consent form signed before development begins
3

Feature Design & Human-in-the-Loop Mapping

Design the feature using GenAI-specific artifacts. These extend or replace conventional wireframes and user stories for GenAI features. Three artifacts are required: a Human-in-the-Loop Map, a Failure Mode Map, and a Prompt Design Brief.

Step 3A — Human-in-the-Loop (HITL) Map

For every GenAI feature, map where and how humans interact with, override, and improve the AI across four zones. HITL design is not optional — it is an architectural and UX decision that affects build effort, cost, and the acceptance criteria defined in Stage 4.

Zone 1: First Use

How is the AI capability introduced? What can it do? What are its limits?

Decisions: onboarding copy, capability display, opt-in or opt-out mechanism, initial trust-building

Zone 2: During Use

How does the human steer, edit, approve, or override AI output in real time?

Decisions: inline editing, confidence indicators, regenerate button, streaming vs batch display

Zone 3: When Things Go Wrong

What happens when output is wrong, refused, or unusable?

Decisions: manual fallback path, error messages, feedback capture, escalation route

Zone 4: Over Time

How do humans improve the system after launch?

Decisions: feedback loop design, prompt tuning cadence, model version monitoring, quality review process

Worksheet 3A — HITL Map (complete per feature)
Step 3B — Failure Mode Map

GenAI has failure types that do not exist in conventional software. All three must be mapped. Background errors are the most dangerous — neither the user nor the system notices them, and they can persist for weeks.

⚠️ Visible Failures (user notices)

  • Hallucinated facts the user can spot
  • Generation refused due to content policy
  • Off-topic or nonsensical output
  • Timeout or no response returned

Response required: Design error messages, fallback paths, and feedback capture for each type before build begins.

🔴 Background Errors (nobody notices)

  • Subtly incorrect facts presented confidently
  • Outdated information from stale retrieval context
  • Systematic bias in outputs (e.g. always recommends the same approach)
  • Silent quality drift after a model provider update

Response required: Active monitoring, automated eval runs against a golden dataset, and human sampling — not just user feedback.

🔵 Context Errors (system works, user unhappy)

  • Correct output, wrong timing or context
  • Output technically accurate but misses intent
  • Cultural or domain-specific mismatch
  • Over-personalisation from stale or incorrect signals

Response required: User research and clear explanation of what signals the system uses to generate output.

Worksheet 3B — Failure Mode Map
Step 3C — Prompt Design Brief

Prompt design is a scoped deliverable, not an implementation detail. Prompts are the primary mechanism for controlling output quality and behaviour. A prompt change can fundamentally alter a feature. Prompts must be versioned, owned, and included in the change management process — treated the same as code.

Worksheet 3C — Prompt Design Brief
Stage 3 Output

Feature Design Package

  • HITL Map — all four zones documented with named owners per zone
  • Failure Mode Map — visible, background, and context errors with designed mitigations
  • Prompt Design Brief — role, guardrails, context method, ownership, and change process defined
  • AI Feature Inventory entry (one row per feature: model, data sources, HITL zones, risk level)
4

Acceptance Criteria Definition Sign-off Gate

Define what "done" and "good enough" mean for probabilistic output — in language a client can sign off on. GenAI acceptance criteria have three distinct layers and use tolerance bands rather than binary pass/fail. This is the single most common gap in GenAI delivery processes, and the most commercially important to get right.

The 3-Layer AC Model for GenAI Features

Traditional ACs are binary — it works or it doesn't. GenAI ACs must cover three separate layers. All three must be defined and signed before development begins. Missing any layer creates ambiguity at UAT and a potential contract dispute at go-live.

Layer 1 — Functional ACs

The plumbing works. Deterministic. Binary pass/fail. Standard to test.

  • Feature loads and is accessible
  • API calls complete successfully
  • Output renders in the correct format
  • Latency is within defined threshold
  • Fallback triggers correctly on failure
  • Feedback mechanism records and stores correctly

Layer 2 — Quality ACs

Output meets a defined quality bar. Probabilistic. Expressed as tolerance bands.

  • Measured against a pre-built offline test bank
  • Scored on agreed dimensions: accuracy, relevance, tone, safety
  • Format: "≥X% of outputs score ≥Y on rubric Z"
  • Reviewers calibrated on scoring rubric before UAT begins
  • Minimum test bank size: 30 prompts (50 recommended)

Layer 3 — Guardrail ACs

The feature never does X. Zero tolerance. Binary — any failure blocks go-live.

  • Never outputs content in prohibited categories
  • Never includes restricted information (e.g. live pricing, PII)
  • Never bypasses mandatory human approval steps
  • Never stores conversation data in prohibited locations
How to write a Quality AC — the tolerance band formula
Standard Format

"When presented with [input type], the feature must produce output that scores [threshold] or above on [dimension] in [X out of Y] evaluations by [evaluator type], tested against a bank of [N] representative prompts."

Example — RFP Response Generator
  • "80% of AI-generated RFP sections must be rated 'Acceptable' or better against the scoring rubric, measured across a test bank of 50 representative prompts."
  • "0% of outputs may contain pricing information." (Guardrail — zero tolerance.)
  • "Average generation time must be under 8 seconds for a standard section." (Functional — binary.)
Example — Customer Support Chatbot
  • "Responses must be rated 'Helpful' or 'Partially helpful' in at least 85% of test cases by two independent reviewers."
  • "Factual accuracy must be ≥90% against the verified knowledge base, confirmed by monthly automated eval run."
  • "The feature must never provide medical or legal advice under any circumstances." (Guardrail — zero tolerance.)
Worksheet 4 — Acceptance Criteria Definition
Worksheet 4 — 3-Layer Acceptance Criteria
Format: "[X]% of outputs must score [Y] or above on [dimension] across a test bank of [N] prompts"
Output Quality Scoring Rubric (adapt for your feature)

Share this with reviewers during calibration before UAT begins. All reviewers must align on scoring definitions before evaluation starts.

ScoreLabelDescriptionReviewer Action
4 — ExcellentAcceptAccurate, relevant, and appropriately toned. Requires minimal or no editing before use.Accept as-is
3 — AcceptableAccept with editsMostly accurate and relevant. Minor edits needed — wording or completeness — but structure is sound.Edit then use
2 — MarginalRejectPartially useful but significant issues — inaccuracy, off-topic sections, or wrong tone. Requires substantial rework.Reject — regenerate or write manually
1 — UnacceptableReject + FlagWrong, harmful, off-brief, or violates a guardrail. Cannot be used in any form.Reject + flag for investigation
Stage 4 Output

Signed GenAI Acceptance Criteria Document

  • Layer 1 (Functional), Layer 2 (Quality with tolerance bands), Layer 3 (Guardrails) all defined
  • Scoring rubric agreed and reviewer calibration plan confirmed
  • Test bank scope, ownership, and delivery timeline confirmed
  • Go-live threshold defined in writing and signed by client Product Owner
5

Cost Estimation, Prioritisation & Roadmap

Estimate token costs, score and prioritise GenAI features using Tkxel's extended scoring matrix, and produce a sequenced MVP roadmap. GenAI features require two additional scoring dimensions beyond the standard TXL criteria.

Step 5A — Token Cost Estimation

Token costs are usage-based and hard to predict precisely at discovery time. The goal is not precision — it is a defensible range that protects both Tkxel and the client from commercial surprises. Present the estimate as a low/high range with clearly stated assumptions.

Worksheet 5A — Token Cost Estimation
Step 5B — Extended Feature Scoring Matrix

Score each GenAI feature across five dimensions. The first three are standard Tkxel criteria. The last two are GenAI-specific additions required for all features passing through this process.

DimensionScore 1 — LowScore 3 — MediumScore 5 — High
T — Transformation Impact
Business alignment & strategic impact
Low strategic alignment, unclear value propositionModerate alignment, clear departmental valueStrong strategic fit, material revenue or efficiency impact
X — Experience Impact
User desirability
Low user desirability, high anticipated change resistanceUsers see value, moderate adoption expectedHigh user demand, strong pull, low resistance
L — Launch Feasibility
Technical & operational readiness
Data unavailable, governance unclear, high complexityData available, governance manageable, moderate complexityData ready, clear compliance path, low complexity
C — Cost Viability
GenAI-specific: token economics
Token costs unacceptable at projected scale, or unknownCosts within budget with usage controls in placeLow token cost per unit of value delivered, clear agreed cost model
R — Reusability
GenAI-specific: foundation for future features
One-off feature, no reuse potential identifiedPattern partially reusable in 1–2 future featuresEstablishes a reusable foundation for multiple future features

Calculating the 2×2 position: Strategic Business Impact = mean of T + X scores. Execution Fit = mean of L + C scores. Plot each feature on the 2×2 matrix: high-impact high-fit = Accelerate to MVP · high-impact low-fit = Incubate · low-impact high-fit = Quick Win · low-impact low-fit = Shelve. R score acts as a tiebreaker between features with equal positions.

Step 5C — HITL Build Effort Sizing

Using the HITL Map from Stage 3, estimate the engineering and UX effort required for human-AI interaction design. This contributes to sprint estimates alongside standard build complexity.

XS — Minimal HITL

Read-only display of AI output. No editing, no explicit feedback mechanism. Human uses or ignores output.

Approx. 0.5 sprint additional effort

S — Basic HITL

Editable output field. Accept or reject action. Simple thumbs up/down feedback. No complex override flows.

Approx. 1 sprint additional effort

M — Standard HITL

Multi-step review flow. Confidence indicators. Audit trail. Regenerate with parameters. Feedback routed to review team.

Approx. 2–3 sprints additional effort

L — Complex HITL

Role-based approval workflow. Full audit log. Admin review dashboard. Integration with existing approval systems.

Approx. 4+ sprints additional effort

Stage 5 Output

GenAI Feature Roadmap

  • Extended 5-dimension scores (T, X, L, C, R) for all GenAI features
  • 2×2 prioritisation matrix placement: Accelerate / Quick Win / Incubate / Shelve
  • Token cost estimates per feature with agreed payment model
  • HITL effort sizing (XS/S/M/L) per feature contributing to sprint estimate
  • Sequenced MVP roadmap with phasing and accountable owners per feature
6

Post-Launch Model New — Required for all GenAI features

Define what happens after go-live — before go-live. GenAI features are never "done." Prompts drift, models update, usage patterns change, and quality degrades silently if not monitored. This stage defines the operational and commercial model for ongoing AI health.

Why this stage is mandatory

The majority of AI features that fail do not fail at launch — they fail 3–6 months later. The root causes are always the same: no measurement, no defined ownership of quality, and no budget for ongoing improvement. This stage prevents that outcome. If it is skipped at discovery, the conversation will happen as an emergency 6 months into production.

Step 6A — Feedback & Monitoring Design

📥 Implicit Feedback (collected automatically)

  • Accept / reject rate on AI-generated outputs
  • Edit distance — how much users change the output before using it
  • Regeneration rate — how often users ask for a new output
  • Time-to-accept — how long users spend reviewing before acting
  • Feature abandonment rate — users who start but don't complete

📣 Explicit Feedback (user-initiated)

  • Thumbs up / down on each output
  • Optional free-text "what was wrong?" field on rejections
  • Monthly in-product satisfaction pulse (CSAT or similar)
  • Ability to flag specific outputs as problematic
Step 6B — Alert Thresholds

Every GenAI feature must have at least three defined alert conditions before go-live. Use this format: "If [metric] for [feature] drops below / goes above [threshold], [named person] will take [specific action] within [timeframe]."

Worksheet 6 — Post-Launch AI Health Plan
Step 6C — Tuning Engagement Model

Prompt tuning and model monitoring after go-live are not bug fixes. They are a distinct, ongoing category of work requiring their own commercial scope. Clients must understand this before sign-off — post-launch tuning is how a GenAI feature continues to improve, and it must be funded deliberately.

Option A — Included in support retainer

Monthly AI health review and prompt updates included in the standard support retainer. Appropriate for low-complexity features with stable use cases and low feedback volume.

Best suited to: single feature, stable domain, monthly eval cadence is sufficient

Option B — Quarterly improvement sprint

Dedicated 1-sprint engagement per quarter: review feedback trends, update prompts, run regression, produce quality trend report. Scoped and contracted separately per quarter.

Best suited to: multiple features, evolving domain, active user feedback volume

Option C — Continuous AI operations

Dedicated AI engineer on monthly retainer with automated monitoring, proactive tuning, and monthly quality reports. For production features where output quality is business-critical.

Best suited to: high-stakes features, regulated environments, rapid user growth

Stage 6 Output

Post-Launch AI Health Plan

  • Feedback design: implicit signals defined + explicit feedback mechanism scoped
  • At least 3 alert conditions defined with named owners and response timeframes
  • Automated eval cadence and human review cadence agreed
  • Tuning engagement model selected (A/B/C) and commercially scoped before go-live
  • Client-side and Tkxel-side AI health owners named and confirmed

GenAI Feature Canvas — Primary Sign-off Artifact

One canvas per GenAI feature. This is the consolidated sign-off document for all GenAI features — replacing or extending the conventional feature spec. It captures outputs from all 6 stages on a single page. Both the client Product Owner and the Tkxel Delivery Lead must sign before development begins.

GenAI Feature Canvas — [Feature Name] · v1.0 · [Date]

Feature Identity

Feature name

Classification

Automate / Augment / Hybrid

Model tier & provider

HITL complexity

XS / S / M / L
Outcome Statement

This feature should improve [measurable metric] by [threshold] within [timeframe] for [user group]…

Expectation Statement

It will help by… / It will not… / Over time… / Users can improve it by…

Human-in-the-Loop Design

Zone 1: first use · Zone 2: during use · Zone 3: when wrong · Zone 4: over time. Named owner per zone.

Failure Mode Map

Top 3 visible failures + mitigation. Background error detection plan. Highest-stakes scenario + control.

Acceptance Criteria — Layer 1 (Functional)

Binary pass/fail ACs. Latency threshold. Fallback trigger. All deterministic conditions listed.

Acceptance Criteria — Layer 2 (Quality)

Tolerance bands: "[X]% of outputs score [Y] on [dimension] across [N] test bank prompts." Rubric attached.

Acceptance Criteria — Layer 3 (Guardrails — zero tolerance)

This feature must NEVER: [list all prohibited output categories]. Tested against full test bank at launch and monitored continuously.

Token Cost Estimate

Low: $__ / month · High: $__ / month
Payment model: client / Tkxel / shared

Feature Score (T / X / L / C / R)

T: __ · X: __ · L: __ · C: __ · R: __
Strategic Impact: __ · Execution Fit: __
Quadrant: Accelerate / QW / Incubate / Shelve

Prompt Design

Owner: __ · Version control: __ · Context method: __ · Change approval: __

Post-Launch Model

Eval cadence: __ · Alert threshold: __ · Tuning model: A / B / C
Client owner: __ · Tkxel owner: __

Sign-off — Required from all three parties before development begins

Client Product Owner

Name · Date

Client Technical Lead

Name · Date

Tkxel Delivery Lead

Name · Date