Template fields to include
Use one row per evaluation case. Include the user question, user role, expected source document, expected section, acceptable answer shape, required citation, access rule, and expected behavior.
Add columns for retrieval result, answer quality, citation accuracy, refusal behavior, escalation behavior, reviewer notes, and the release decision.
Positive and negative cases
A healthy set includes questions the system should answer and questions it should not answer. Positive cases test whether the right source is found. Negative cases test missing context, stale documents, restricted records, and conflicting policies.
For private document search, refusal and escalation examples are as important as successful answers because employees may act on the output.
Permission and source ownership
Every case should name the user role and whether that role is allowed to retrieve the source. If the answer would expose restricted content, the expected behavior should be refusal or escalation without leaking source details.
Source ownership should be explicit. A policy owner, analyst, clinical operator, or department lead should be able to update the expected answer when the underlying document changes.
Review cadence and release criteria
Review failures weekly during the pilot. Separate retrieval misses, citation mistakes, permission issues, weak answers, and questions that reveal a workflow gap.
Before rollout, define the minimum acceptable score for source retrieval, citation accuracy, refusal behavior, and reviewer acceptance. The threshold should match the risk of the workflow.