About AI agent evaluations

Atlas AI agent evaluations test how agents respond to specific prompts, like questions about equipment status or maintenance history, to help you identify where to improve agent performance.

How evaluations work

Evaluations use test cases to verify agent behavior. You create test cases by defining prompts and expected responses. For example, you can create a test case with the prompt, “What is the status of pump P-101?” and an expected response like “Pump P-101 is operational and running at 85% capacity with no active alerts.” The expected response specifies information like operational status, performance metrics, and alert status to define what a successful response includes. The following diagram shows the workflow for evaluating Atlas AI agents: Test cases are examples of scenarios that you want your agents to understand how to respond to. When you run evaluations, you can view results to see how your agents perform when Atlas AI compares their responses to your expected responses. These results help you identify patterns, like which types of prompts your agents struggle to understand, so you can improve performance.

Why evaluations matter

When you build agents to solve business problems and automate workflows, evaluations help you verify that they provide the results that you expect. To verify that agents provide the expected results, you can run evaluations at different stages of your agent workflows to monitor performance:

Check for regression - Verify that agents still work after you update prompts, add tools, or change configuration.
Test before deployment - Verify agents work as expected before you make them available to users.
Record expected behavior - Use test cases to document how agents should respond.
Track performance over time - Track how changes affect agent responses to identify where to improve.

When you run evaluations at these stages, you can test different areas of performance to understand where your agents can improve:

Response accuracy - Test if agents provide the information you expect when prompted about industrial data.
Consistency - Test if agents respond the same way to similar questions.
Edge cases - Test how agents handle unclear questions, missing data, or unusual scenarios.

Build

​How evaluations work

​Why evaluations matter

How evaluations work

Why evaluations matter