Skip to main content
Atlas AI agent evaluations test how agents respond to specific prompts, like questions about equipment status or maintenance history, to help you identify where to improve agent performance.

How evaluations work

Evaluations use test cases to verify agent behavior. You create test cases by defining prompts and expected responses. For example, you can create a test case with the prompt, “What is the status of pump P-101?” and an expected response like “Pump P-101 is operational and running at 85% capacity with no active alerts.” The expected response specifies information like operational status, performance metrics, and alert status to define what a successful response includes. This diagram shows the workflow for evaluating Atlas AI agents: Test cases define the scenarios your agent should handle and the expected responses for each. When you run evaluations, you can view results to see how your agents perform when Atlas AI compares their responses to your expected responses. These results help you identify patterns, like which types of prompts your agents struggle to understand, so you can improve performance.

Why evaluations matter

When you build agents to solve business problems and automate workflows, evaluations help you verify that they provide the results that you expect. To verify that agents provide the expected results, you can run evaluations at different stages of your agent workflows to monitor performance:
  • Check for regression - Verify that agents still work after you update prompts, add tools, or change configuration.
  • Test before deployment - Verify agents work as expected before you publish them.
  • Record expected behavior - Use test cases to document how agents should respond.
  • Track performance over time - Track how changes affect agent responses to identify where to improve.

Building effective test cases

The value of an evaluation depends on the quality of its test cases. Build your test case library before you start tuning the agent. This gives you a baseline to measure against so you can verify that a change improved the responses you intended to improve without affecting other question types. Follow these practices to build test cases that produce meaningful results.

Start with the questions users ask most

Identify the 10 to 20 prompts that represent your agent’s core use cases. Write specific expected responses for each one, including the data points, format, and level of detail you would consider a successful answer. Specific expectations make it easier to identify where the agent’s response does not meet the expected output.

Cover different question types

Include simple lookups (“What is the status of Pump P-101?”), multi-step queries (“What maintenance was done on Compressor C-205 in the last 6 months and are there any related open work orders?”), and edge cases such as equipment that does not exist or questions outside the agent’s defined scope.

Re-run after every configuration change

When you update skills, modify instructions, change tool configurations, or upgrade the language model, re-run your test cases before publishing. This confirms that changes work as intended and do not affect other question types.

Use results to guide iteration

When an agent’s response diverges from the expected output, use the gap to identify what to fix. Unexpected responses often point to missing vocabulary mappings, ambiguous instructions, or skill trigger conditions that need to be more specific.
Last modified on June 18, 2026