Why Evals Matter
AI agents are non-deterministic - the same input can yield different results. Evals help you:- Validate outputs: Ensure agents produce correct responses
- Check tool usage: Verify agents use the right tools with correct inputs
- Monitor performance: Track execution time and token usage
- Catch regressions: Prevent breaking changes during development
- Test execution patterns: Validate sequential and parallel tool execution
How Evals Work
Timbal’s eval system uses a YAML-based test definition format with a powerful validator system:Core Components
Validators
20+ validators for checking outputs, patterns, types, and more
Flow Validators
Validate execution sequences and parallel tool calls
LLM Validators
AI-powered semantic validation for natural language
CLI
Command-line interface for running and discovering evals
Quick Start
1. Create a test file
Create a file namedeval_greeting.yaml:
2. Run your evals
3. View results
The CLI displays pytest-style output with pass/fail status, duration, and detailed failure information:Eval Structure
Each eval consists of:| Field | Description | Required |
|---|---|---|
name | Unique identifier for the eval | Yes |
runnable | Path to the runnable (file.py::name) | Yes |
params | Input parameters for the runnable | No |
description | Human-readable description | No |
tags | List of tags for filtering | No |
timeout | Maximum execution time in milliseconds | No |
output | Validators for the final output | No |
elapsed | Validators for total execution time | No |
seq! | Sequence flow validator | No |
Eval names must be unique across all eval files. The CLI will error if duplicate names are found.