Separate answer quality from workflow quality
A final answer can look polished even when the agent skipped a required source, used the wrong tool, or should have escalated. Evaluation should score the intermediate steps as well as the final output.
For business workflows, the evaluation set should include successful cases, missing-source cases, permission-boundary cases, and examples where the correct behavior is to refuse or route to a human.
Score source use and tool behavior
Track whether the agent retrieved the right documents, cited the right source, called approved tools in the right order, and avoided actions outside the pilot scope.
For finance, healthcare, and operations work, a useful agent log should make it possible to inspect what the agent saw, what it produced, what it changed, and what a reviewer edited.
Measure human review and business outcomes
Reviewer edits are one of the most useful early signals. Track which fields were changed, which outputs were rejected, which cases were escalated, and where the agent created more work.
The business metric should match the workflow: cycle time, queue aging, false positives, missed signals, acceptance rate, rework, or staff time saved.