Build an evaluation set from real questions
Start with questions employees already ask in Slack, support queues, onboarding, compliance reviews, or analyst workflows. Include easy lookups, multi-document questions, ambiguous requests, and questions the system should refuse.
For every question, keep the expected source document, the expected answer shape, and the reason a reviewer would accept or reject the answer.
Measure retrieval before generation
If the right source is missing, the final answer cannot be trusted. Evaluate whether the retriever finds the correct document, section, table, or record before judging the generated response.
Track failure modes such as outdated documents, near-duplicate policies, access-restricted sources, and chunks that split important context.
Make uncertainty visible
A useful internal knowledge assistant should say when it cannot answer from approved sources. It should show citations, explain missing context, and route the user to the next best source or owner.
This is especially important for healthcare, finance, legal, and operational workflows where users may act on the answer.