- Getting started
- Prerequisites
- Building agents

Agents user guide
Evaluations
When you're building an agent, the goal is to make it reliable—something you can trust to give the right output consistently. Evaluations help you figure out if your agent is doing a good job or if it needs improvement.
An evaluation is a pair between an input and an assertion made on the output. An assertion is a defined condition or rule used to evaluate whether the agent's output meets the expected output.
Evaluation sets are logical groupings of evaluations.
Evaluation results are traces for completed evaluation runs that assess the performance of an agent. During these runs, the agent's accuracy, efficiency, and decision-making ability are measured and scored on how well the agent performs. The evaluation score determines how well the agent is performing based on the assertions in a specific evaluation. The score is on a scale from 0 to 100. Failed evaluation runs need to be re-run and debugged.
Before you create an evaluation, you must first test your agent to see if its output is correct or not. If your agent is executing correctly, you can create evaluations from the correct traces. If your agent is not executing correctly and its output is incorrect, you can create evaluations from scratch.
- After you design your agent, in the Playground pane, add the necessary input for the test run and select Run.
- Once the run is complete and
if the output is correct, select the Add to eval set button.
If the agent's output isn't correct, you can:
- Refine the prompt: Adjust the prompt and rerun the agent until the output is correct.
- Create evaluations from incorrect outputs: Generate evaluations based on the incorrect outputs and manually edit them to align with the expected outcome.
Alternatively, after the test run, go to the Traces tab to see the details of the run. Select View trace, then select Add to eval set.
- Select Create eval set
and choose a name for this set. Confirm your action by selecting the
check-mark icon.
The new set is now listed in the Select eval sets pane. Select it, then select Next to go to the Create evaluation window. Here you will create the first evaluation in the set.
- In the Create evaluation window, the Input and Expected output fields are already pre-filled with the input and output arguments you created for the agent's prompt. If using the default LLM-as-a-Judge assertion type, add an evaluation prompt, then select Create to finalize the evaluation.
- After you design your agent, go to the Evaluations tab and select
Create set.
You can also select Import to use existing JSON data from evaluations of other agents.
- Choose a name for your new evaluation set and select Create.
The evaluation set is created and the Create evaluation window is displayed.
- Create the first evaluation in this set:
- Configure the Input fields. These fields are inherited from the input arguments you create for prompts.
- Configure the Expected output. This is inherited from the output argument(s) you created.
- Under Evaluation
settings, configure the following fields:
- Select the
Target output field:
-
Root-level targeting (* All): Evaluates the entire output.
-
Field-specific targeting: Assesses specific first-level fields. Use the dropdown menu to select a field. The listed output fields are inherited from the output arguments you defined for the system prompt. For details, see Prompt arguments.
The following image shows the Target output field options: * (All), the default option, and Rewritten content, an output argument defined when designing the agent.
-
- Select the
Assertion type. This represents the evaluation
method:
- LLM-as-a-Judge (default method)
- Recommended as the default approach when targeting the root output.
- Provides flexible evaluation of complex outputs.
- Can assess quality and correctness beyond exact matching.
- Best used when evaluating reasoning, natural language responses, or complex structured outputs.
-
Equals
- Recommended when expecting exact matches.
- Most effective when output requirements are strictly defined.
- Works with complex objects, but is best used
with:
- Boolean responses (true/false)
- Specific numerical values
- Exact string matches
- Arrays of primitives.
- LLM-as-a-Judge (default method)
- Select the
Target output field:
- Select Create to save the new evaluation.
A well-structured output makes evaluations more reliable. That’s why it is good to have structured outputs—it ensures consistency and makes comparisons easier.
The following image shows an example of a predefined prompt that evaluated the entire output:
As an expert evaluator, analyze the semantic similarity of these JSON contents to determine a score from 0-100. Focus on comparing the meaning and contextual equivalence of corresponding fields, accounting for alternative valid expressions, synonyms, and reasonable variations in language while maintaining high standards for accuracy and completeness. Provide your score with justification, explaining briefly and concisely why you gave that score.
Expected Output: {{ExpectedOutput}}
ActualOutput: {{ActualOutput}}
For simple agents, aim for approximately 30 evaluations across 1-3 evaluation sets. For more complex agents, it is recommended to have at least double that amount or more.
The number of evaluations depends on:
- Agent complexity
- Number of input parameters
- Output structure complexity
- Tool usage patterns
- Decision branches
- Input
- Range of possible inputs: data types, value ranges, optional fields
- Edge cases
- Usage patterns
- Common use cases
- Different personas
- Error scenarios
Grouping evaluations into sets helps organize them better. For example, you can have:
- One set for full output evaluation
- Another for edge cases
- Another for handling misspellings.
Coverage principles
- Logical coverage: Map out input combinations, edge cases, and boundary conditions.
- Redundancy management: Aim for 3-5 different evaluations per logically equivalent case.
- Quality over quantity: More evaluations don’t always mean better results. Focus on meaningful tests.
Create evaluations once the arguments are stable or complete. That also means your use case is established, and the prompt, tools, and Context Grounding indexes are finalized.
If you modify the arguments, you need to adjust your evaluations accordingly. To minimize additional work, it's best to start with stable agents that have well-defined use cases.
You can export and import evaluation sets between agents within the same organization or across different organizations. As long as your agent design is complete, you can move evaluations around as needed without having to recreate them from scratch.