agents
latest
false
  • Getting started
    • About agents
    • Agents workspace
    • Limitations
  • Prerequisites
  • Building agents
UiPath logo, featuring letters U and I in white

Agents user guide

Last updated Mar 7, 2025

Evaluations

Note: This feature is available in preview.

About evaluations

When you're building an agent, the goal is to make it reliable—something you can trust to give the right output consistently. Evaluations help you figure out if your agent is doing a good job or if it needs improvement.

Terminology

An evaluation is a pair between an input and an assertion made on the output. An assertion is a defined condition or rule used to evaluate whether the agent's output meets the expected output.

Evaluation sets are logical groupings of evaluations.

Evaluation results are traces for completed evaluation runs that assess the performance of an agent. During these runs, the agent's accuracy, efficiency, and decision-making ability are measured and scored on how well the agent performs. The evaluation score determines how well the agent is performing based on the assertions in a specific evaluation. The score is on a scale from 0 to 100. Failed evaluation runs need to be re-run and debugged.

Creating evaluations

Before you create an evaluation, you must first test your agent to see if its output is correct or not. If your agent is executing correctly, you can create evaluations from the correct traces. If your agent is not executing correctly and its output is incorrect, you can create evaluations from scratch.

Creating evaluations from agent test runs and traces

  1. After you design your agent, in the Playground pane, add the necessary input for the test run and select Run.
  2. Once the run is complete and if the output is correct, select the Add to eval set button.

    If the agent's output isn't correct, you can:

    • Refine the prompt: Adjust the prompt and rerun the agent until the output is correct.
    • Create evaluations from incorrect outputs: Generate evaluations based on the incorrect outputs and manually edit them to align with the expected outcome.

    Alternatively, after the test run, go to the Traces tab to see the details of the run. Select View trace, then select Add to eval set.

  3. Select Create eval set and choose a name for this set. Confirm your action by selecting the docs image check-mark icon.

    The new set is now listed in the Select eval sets pane. Select it, then select Next to go to the Create evaluation window. Here you will create the first evaluation in the set.

  4. In the Create evaluation window, the Input and Expected output fields are already pre-filled with the input and output arguments you created for the agent's prompt. If using the default LLM-as-a-Judge assertion type, add an evaluation prompt, then select Create to finalize the evaluation.

Creating evaluations from scratch

  1. After you design your agent, go to the Evaluations tab and select Create set.

    You can also select Import to use existing JSON data from evaluations of other agents.

  2. Choose a name for your new evaluation set and select Create.

    The evaluation set is created and the Create evaluation window is displayed.

  3. Create the first evaluation in this set:
    1. Configure the Input fields. These fields are inherited from the input arguments you create for prompts.
    2. Configure the Expected output. This is inherited from the output argument(s) you created.
    3. Under Evaluation settings, configure the following fields:
      • Select the Target output field:
        • Root-level targeting (* All): Evaluates the entire output.

        • Field-specific targeting: Assesses specific first-level fields. Use the dropdown menu to select a field. The listed output fields are inherited from the output arguments you defined for the system prompt. For details, see Prompt arguments.

          The following image shows the Target output field options: * (All), the default option, and Rewritten content, an output argument defined when designing the agent.


          docs image

      • Select the Assertion type. This represents the evaluation method:
        • LLM-as-a-Judge (default method)
          • Recommended as the default approach when targeting the root output.
          • Provides flexible evaluation of complex outputs.
          • Can assess quality and correctness beyond exact matching.
          • Best used when evaluating reasoning, natural language responses, or complex structured outputs.
        • Equals
          • Recommended when expecting exact matches.
          • Most effective when output requirements are strictly defined.
          • Works with complex objects, but is best used with:
            • Boolean responses (true/false)
            • Specific numerical values
            • Exact string matches
            • Arrays of primitives.
    4. Select Create to save the new evaluation.

Structure your evaluation prompt

A well-structured output makes evaluations more reliable. That’s why it is good to have structured outputs—it ensures consistency and makes comparisons easier.

The following image shows an example of a predefined prompt that evaluated the entire output:


docs image

Example prompt

As an expert evaluator, analyze the semantic similarity of these JSON contents to determine a score from 0-100. Focus on comparing the meaning and contextual equivalence of corresponding fields, accounting for alternative valid expressions, synonyms, and reasonable variations in language while maintaining high standards for accuracy and completeness. Provide your score with justification, explaining briefly and concisely why you gave that score.

Expected Output: {{ExpectedOutput}}

ActualOutput: {{ActualOutput}}

Number of evaluations

For simple agents, aim for approximately 30 evaluations across 1-3 evaluation sets. For more complex agents, it is recommended to have at least double that amount or more.

The number of evaluations depends on:

  • Agent complexity
    • Number of input parameters
    • Output structure complexity
    • Tool usage patterns
    • Decision branches
  • Input
    • Range of possible inputs: data types, value ranges, optional fields
    • Edge cases
  • Usage patterns
    • Common use cases
    • Different personas
    • Error scenarios

Evaluation sets

Grouping evaluations into sets helps organize them better. For example, you can have:

  • One set for full output evaluation
  • Another for edge cases
  • Another for handling misspellings.

Coverage principles

  • Logical coverage: Map out input combinations, edge cases, and boundary conditions.
  • Redundancy management: Aim for 3-5 different evaluations per logically equivalent case.
  • Quality over quantity: More evaluations don’t always mean better results. Focus on meaningful tests.

When to create evaluations

Create evaluations once the arguments are stable or complete. That also means your use case is established, and the prompt, tools, and Context Grounding indexes are finalized.

If you modify the arguments, you need to adjust your evaluations accordingly. To minimize additional work, it's best to start with stable agents that have well-defined use cases.

You can export and import evaluation sets between agents within the same organization or across different organizations. As long as your agent design is complete, you can move evaluations around as needed without having to recreate them from scratch.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2025 UiPath. All rights reserved.