RAG Evaluation Set Template

RAG evaluation set template for private document search.

A useful RAG evaluation set is not just a list of questions. It is a structured review table that names the expected source, answer shape, access rules, refusal cases, and reviewer decision criteria for each example.

Operations, healthcare, finance, support, legal, and internal tools teams preparing RAG over private documents.

Each test case should include the question, expected source, answer shape, access rule, and acceptance criteria.

Include positive answers, stale-source traps, missing-context cases, permission failures, and escalation examples.

Reviewer notes should decide whether to tune retrieval, chunking, prompts, permissions, or the workflow itself.

Template fields to include

Use one row per evaluation case. Include the user question, user role, expected source document, expected section, acceptable answer shape, required citation, access rule, and expected behavior.

Add columns for retrieval result, answer quality, citation accuracy, refusal behavior, escalation behavior, reviewer notes, and the release decision.

Positive and negative cases

A healthy set includes questions the system should answer and questions it should not answer. Positive cases test whether the right source is found. Negative cases test missing context, stale documents, restricted records, and conflicting policies.

For private document search, refusal and escalation examples are as important as successful answers because employees may act on the output.

Permission and source ownership

Every case should name the user role and whether that role is allowed to retrieve the source. If the answer would expose restricted content, the expected behavior should be refusal or escalation without leaking source details.

Source ownership should be explicit. A policy owner, analyst, clinical operator, or department lead should be able to update the expected answer when the underlying document changes.

Review cadence and release criteria

Review failures weekly during the pilot. Separate retrieval misses, citation mistakes, permission issues, weak answers, and questions that reveal a workflow gap.

Before rollout, define the minimum acceptable score for source retrieval, citation accuracy, refusal behavior, and reviewer acceptance. The threshold should match the risk of the workflow.

Moonveil AI Inc.

Turn the checklist into a scoped pilot.