Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions build/agents/build-your-agent/evals.mdx
Original file line number Diff line number Diff line change
@@ -1,18 +1,24 @@
---
title: 'Evals'
sidebarTitle: 'Evals'

Check warning on line 3 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L3

Did you really mean 'Evals'?
description: 'Test and evaluate your AI Agents with scenario-based evaluations and automated Evaluators'
---

<Info>
**Rollout Status**: Evals is currently being rolled out progressively, starting with Enterprise customers. If you're an Enterprise customer and don't see this feature in your account yet, reach out to your account manager to discuss access.

Check warning on line 8 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L8

Did you really mean 'Evals'?
</Info>

The Evals section is your command center for testing and evaluating AI Agent performance. Located in the **Monitor** tab (next to the Run tab) in the Agent builder, Evals enables you to create Test Suites, define evaluation criteria (Evaluators), run automated evaluations, and monitor ongoing performance—all without manual testing.

Check warning on line 11 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L11

Did you really mean 'Evals'?

Check warning on line 11 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L11

Did you really mean 'Evals'?

![Evals section showing Test Suites, Evaluators, Runs, and Performance](/images/agent/agent-evals.png)

Check warning on line 13 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L13

Did you really mean 'Evals'?

## Programmatic access

You can manage the full evaluation lifecycle programmatically using the Relevance AI MCP server. This covers creating test sets and test cases, configuring evaluator rules and tool simulations, triggering runs, and retrieving results — enabling CI/CD integration and automated testing workflows. See [Programmatic evals via MCP](/build/agents/build-your-agent/evals/programmatic-evals) for details.

---

## What you can do with Evals

Check warning on line 21 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L21

Did you really mean 'Evals'?

<CardGroup cols={3}>
<Card title="Conduct Tests" icon="flask-vial">
Expand All @@ -28,11 +34,11 @@

---

## Evals sections

Check warning on line 37 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L37

Did you really mean 'Evals'?

The Evals section contains five main sections, accessible from the left sidebar:

Check warning on line 39 in build/agents/build-your-agent/evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals.mdx#L39

Did you really mean 'Evals'?

- **Test Suites** — Create and manage groups of Test scenarios for your Agent. Each Test Suite can contain multiple scenarios with different prompts and evaluation criteria.

Check warning on line 41 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

5 settings listed as bullet points — consider using a table instead so they're easier to scan. [technical: 5 consecutive bullet items matching **Key**: value or **Key** — value pattern]
- **Evaluators** — Configure global evaluation criteria that can be applied across any Test Suite or scenario without needing to set them up each time.
- **Runs** — View your evaluation run history and results. See average scores, number of conversations evaluated, progress status, credit spend, and creation dates for all past runs.
- **Publish Checks** — Configure which Test Suites must pass before your Agent can be published. Set a pass threshold and optionally block publishing if evaluations fail.
Expand Down Expand Up @@ -106,7 +112,7 @@
To create a global Evaluator:

<div style={{ width:"100%",position:"relative","padding-top":"56.75%" }}>
<iframe src="https://app.supademo.com/embed/cmmmtwq7z1lsj9cvj5kwwifwi" frameBorder="0" title="Creating a global Evaluator" allow="clipboard-write; fullscreen" webkitAllowFullscreen="true" mozAllowFullscreen="true" allowFullscreen style={{ position:"absolute",top:0,left:0,width:"100%",height:"100%",border:"3px solid #5E43CE",borderRadius:"10px" }} />

Check failure on line 115 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

Supademo embed is missing rounded corners — use the standard embed snippet. [technical: borderRadius: '10px' missing from iframe style]

Check failure on line 115 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

Supademo embed is missing the purple border — use the standard embed snippet. [technical: border: '3px solid #5E43CE' missing from iframe style]

Check failure on line 115 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

Supademo embed isn't using the standard wrapper — replace it with the snippet from the style guide. [technical: paddingTop: '56.25%' missing from wrapper <div>]
</div>

1. Go to the **Monitor** tab and select **Evals**, then select **Evaluators**
Expand All @@ -125,7 +131,7 @@
## Creating a Test Suite with a scenario

<div style={{ width:"100%",position:"relative","padding-top":"56.75%" }}>
<iframe src="https://app.supademo.com/embed/cmmmvldns1nlq9cvjzy4nkpe0" frameBorder="0" title="Creating a Test Suite" allow="clipboard-write; fullscreen" webkitAllowFullscreen="true" mozAllowFullscreen="true" allowFullscreen style={{ position:"absolute",top:0,left:0,width:"100%",height:"100%",border:"3px solid #5E43CE",borderRadius:"10px" }} />

Check failure on line 134 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

Supademo embed is missing rounded corners — use the standard embed snippet. [technical: borderRadius: '10px' missing from iframe style]

Check failure on line 134 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

Supademo embed is missing the purple border — use the standard embed snippet. [technical: border: '3px solid #5E43CE' missing from iframe style]

Check failure on line 134 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

Supademo embed isn't using the standard wrapper — replace it with the snippet from the style guide. [technical: paddingTop: '56.25%' missing from wrapper <div>]
</div>

Follow these steps to create your first evaluation Test Suite:
Expand Down Expand Up @@ -282,7 +288,7 @@

The Performance tab also includes:

- **Data points** for the overall score over time

Check warning on line 291 in build/agents/build-your-agent/evals.mdx

View workflow job for this annotation

GitHub Actions / Documentation Lint Checks

4 features listed as bullet points — consider using cards instead so they stand out visually. [technical: 4 consecutive bullet items matching **Feature** pattern, use <CardGroup> with <Card> components]
- **Evaluator breakdown** showing individual scoring per Evaluator
- **Graphs** visualizing Evaluator performance trends
- **List of evaluation runs** with score, name, and the ability to view the full conversation
Expand Down
234 changes: 234 additions & 0 deletions build/agents/build-your-agent/evals/programmatic-evals.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
---
title: "Programmatic evals via MCP"

Check warning on line 2 in build/agents/build-your-agent/evals/programmatic-evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals/programmatic-evals.mdx#L2

Did you really mean 'evals'?
sidebarTitle: "Programmatic evals"
description: "Manage the full evaluation lifecycle programmatically using MCP tools from your AI coding assistant."
---

The Relevance AI MCP server includes 19 tools for managing evaluations programmatically. This covers the complete evaluation lifecycle: creating test sets and test cases, configuring evaluator rules and tool simulations, running evaluations, and monitoring batch results.

This enables CI/CD integration, automated testing frameworks, and bulk operations that would be impractical to do through the UI.

<Info>
This page covers the MCP tools for programmatic eval management. For the UI-based workflow, see [Evals](/build/agents/build-your-agent/evals).

Check warning on line 12 in build/agents/build-your-agent/evals/programmatic-evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals/programmatic-evals.mdx#L12

Did you really mean 'eval'?
</Info>

<Info>
**Rollout Status**: Evals is currently being rolled out progressively, starting with Enterprise customers. If you're an Enterprise customer and don't see this feature in your account yet, reach out to your account manager to discuss access.

Check warning on line 16 in build/agents/build-your-agent/evals/programmatic-evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals/programmatic-evals.mdx#L16

Did you really mean 'Evals'?
</Info>

---

## Prerequisites

You need the Relevance AI MCP server connected to your AI coding assistant before using these tools. See the [MCP Server](/integrations/mcp/programmatic-gtm/mcp-server) page for setup instructions.

For better results, also clone the [agent skills](/integrations/mcp/programmatic-gtm/agent-skills) repository — it gives your assistant the knowledge to use MCP tools correctly.

---

## Managing test sets

Test sets (also called Test Suites in the UI) are containers for test cases that you run together as a group.

### What you can do

- Create a new test set for an agent
- List all test sets for an agent
- Get the details of a specific test set
- Update a test set's name or configuration
- Delete a test set

### Example prompts

```
Create a test set called "Customer Support Regression" for agent [agent-id]
```

```
List all test sets for my support agent
```

```
Delete the test set named "Draft Tests" from agent [agent-id]
```

---

## Managing test cases

Test cases are individual scenarios within a test set. Each test case defines a simulated user persona, an opening message, conversation limits, and its own evaluator rules.

### What you can do

- Create a test case within a test set
- List all test cases in a test set
- Get the details of a specific test case
- Update a test case's scenario, persona, or configuration
- Delete a test case

### Example prompts

```
Add a test case to the "Customer Support Regression" test set:
- Scenario name: Billing Dispute
- Persona: An upset customer who was charged twice for the same order
- First message: "I've been double charged and no one is helping me"
- Max turns: 8
```

```
List all test cases in test set [test-set-id]
```

```
Update the "Billing Dispute" test case to increase max turns to 12
```

---

## Configuring evaluator rules

Evaluator rules define the criteria used to assess whether an agent's response passes or fails a test case. You can add, update, and remove evaluator rules on individual test cases.

### Evaluator rule types

| Type | What it checks |
|------|---------------|
| LLM Judge | Evaluates the conversation against a prompt you write, using an LLM to score the result |
| String Contains | Checks whether the agent's response includes specific text |
| String Equals | Checks whether the agent's response exactly matches an expected value |
| Tool Usage | Checks whether a specific tool was used, and how many times or in what position |

### What you can do

- Add an evaluator rule to a test case
- Update an existing evaluator rule
- Remove an evaluator rule from a test case
- List all evaluator rules on a test case

### Example prompts

```
Add an LLM Judge evaluator to test case [test-case-id]:
- Name: Empathy Check
- Prompt: Did the agent acknowledge the customer's frustration before offering a solution?
```

```
Add a Tool Usage evaluator to test case [test-case-id]:
- Name: Escalation Tool Used
- Tool: escalate_to_human
- Check that it was used at least once
```

```
Remove the "String Contains" evaluator from test case [test-case-id]
```

---

## Configuring tool simulation

Tool simulation lets you emulate tool responses during evaluations without actually calling the real tools. This is useful for testing how your agent handles specific tool outputs without incurring real API calls or side effects.

Tool simulations are configured at the test case level. You specify the tool to simulate and a prompt describing the fake response the tool should return.

### Example prompts

```
Add a tool simulation to test case [test-case-id]:
- Tool: get_customer_account
- Simulation prompt: Return a customer account showing two identical charges of $49.99 on the same date
```

```
Update the tool simulation for "get_order_status" in test case [test-case-id] to return a delayed shipment scenario
```

```
Remove the tool simulation for "send_email" from test case [test-case-id]
```

---

## Running evaluations

You can trigger evaluation runs programmatically against a test set. This is the same operation as clicking **Run** in the UI, but callable from scripts, CI pipelines, and automated workflows.

### What you can do

- Run a test set (runs all test cases in the set)
- Run an individual test case
- Include or exclude global evaluators from a run

### Example prompts

```
Run the "Customer Support Regression" test set for agent [agent-id]
```

```
Run test case [test-case-id] and include the "Professional Tone" global evaluator
```

```
Trigger an evaluation run on test set [test-set-id] and name it "v2.3 release check"
```

---

## Monitoring batch results

After triggering a run, you can retrieve the results programmatically — including per-test-case scores, evaluator verdicts, and conversation logs.

### What you can do

- List all evaluation runs for a test set
- Get the detailed results for a specific run, including scores and evaluator verdicts
- Check whether a run is still in progress or complete

### Example prompts

```
List all evaluation runs for test set [test-set-id]
```

```
Get the results for evaluation run [run-id] — show me which test cases passed and which failed
```

```
Check if the latest evaluation run for the "Customer Support Regression" test set has completed
```

---

## CI/CD integration

Because evaluation runs are fully programmable via MCP, you can integrate them into automated pipelines:

- Trigger a test set run as part of a pre-deployment check
- Poll for completion and parse pass/fail status
- Block deployment if scores fall below a threshold

<Accordion title="Example CI/CD workflow using an AI coding assistant">
Ask your AI coding assistant:

```
1. Trigger an evaluation run for test set [test-set-id] on agent [agent-id]
2. Poll every 10 seconds until the run is complete
3. Check whether all test cases passed
4. If any test case scored below 80%, list the failing cases with their evaluator verdicts
5. Return a summary with overall pass rate
```

Your assistant will use the MCP eval tools to carry out each step and return a structured report you can act on.

Check warning on line 225 in build/agents/build-your-agent/evals/programmatic-evals.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

build/agents/build-your-agent/evals/programmatic-evals.mdx#L225

Did you really mean 'eval'?
</Accordion>

---

## Learn more

- [Evals (UI workflow)](/build/agents/build-your-agent/evals) — create and manage evaluations through the Relevance AI interface
- [MCP Server](/integrations/mcp/programmatic-gtm/mcp-server) — connect your AI coding assistant to Relevance AI
- [Agent Skills](/integrations/mcp/programmatic-gtm/agent-skills) — give your assistant built-in knowledge of Relevance AI tools
8 changes: 7 additions & 1 deletion docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,13 @@
"build/agents/build-your-agent/alerts",
"build/agents/build-your-agent/memory",
"build/agents/build-your-agent/variables",
"build/agents/build-your-agent/evals",
{
"group": "Evals",
"pages": [
"build/agents/build-your-agent/evals",
"build/agents/build-your-agent/evals/programmatic-evals"
]
},
{
"group": "Trigger Types",
"pages": [
Expand Down
4 changes: 2 additions & 2 deletions integrations/mcp/programmatic-gtm/agent-skills.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
description: "Give your AI coding assistant built-in knowledge of Relevance AI by cloning the agent skills repository."
---

The [Relevance AI agent skills](https://github.com/RelevanceAI/agent-skills) repository is a local reference that teaches your AI coding assistant how to work with Relevance AI. Clone it once and your assistant gets detailed context on agents, tools, workforces, knowledge, analytics, and evals — without needing to figure things out from scratch.

Check warning on line 7 in integrations/mcp/programmatic-gtm/agent-skills.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

integrations/mcp/programmatic-gtm/agent-skills.mdx#L7

Did you really mean 'evals'?

<Info>Agent skills work alongside the [MCP server](/integrations/mcp/programmatic-gtm/mcp-server). The MCP server gives your assistant the **ability** to call Relevance AI tools. Agent skills give it the **knowledge** to use them well.</Info>

Expand Down Expand Up @@ -54,10 +54,10 @@

<CardGroup cols={2}>
<Card title="SKILL.md" icon="book">
The main skill definition — covers all 46 MCP tools, critical usage rules, and workflow patterns your assistant should follow.
The main skill definition — covers all 65 MCP tools, critical usage rules, and workflow patterns your assistant should follow.
</Card>
<Card title="Reference docs" icon="folder-open">
Detailed guides for agents, tools, workforces, knowledge, analytics, and evals that the assistant reads when working on specific tasks.

Check warning on line 60 in integrations/mcp/programmatic-gtm/agent-skills.mdx

View check run for this annotation

Mintlify / Mintlify Validation (relevanceai) - vale-spellcheck

integrations/mcp/programmatic-gtm/agent-skills.mdx#L60

Did you really mean 'evals'?
</Card>
</CardGroup>

Expand All @@ -80,7 +80,7 @@
Usage metrics and reporting capabilities.
</Card>
<Card title="Evals" icon="check-double">
Testing agent behaviour with evaluation cases.
Full evaluation lifecycle — creating test sets and test cases, configuring evaluator rules and tool simulations, running evaluations, and monitoring batch results.
</Card>
</CardGroup>

Expand Down
Loading