This guide covers the complete syntax for writing evals, from basic tests to complex multi-validator scenarios.
File Structure
Evals are defined in YAML files. Each file can contain multiple eval definitions:
# eval_search.yaml
- name : basic_search
description : Test basic product search
runnable : agents/search.py::search_agent
params :
prompt : "Find me a laptop"
output :
contains! : "laptop"
type! : "string"
- name : search_with_filters
description : Test search with price filters
runnable : agents/search.py::search_agent
params :
prompt : "Find laptops under $1000"
output :
pattern! : " \\ $ \\ d+"
Eval Definition
Required Fields
- name : my_eval_name
runnable : path/to/agent.py::agent_name
Eval names must be unique across all eval files in your project. The CLI will error if duplicate names are found.
Optional Fields
- name : complete_example
description : "A thorough test of greeting behavior"
runnable : agent.py::agent
tags :
- greeting
- smoke-test
timeout : 30000 # Milliseconds
env :
API_KEY : "test-key"
params :
prompt : "Hello"
output :
semantic! : "Friendly greeting"
elapsed :
lt! : 5000
Params Structure
The params field contains input parameters passed to your runnable:
Simple Prompt
params :
prompt : "What's the weather like?"
With Messages
Use messages to establish conversation history for multi-turn testing:
params :
messages :
- role : user
content : "What's the weather like in New York?"
- role : assistant
content : "It's currently 15°C and raining in New York."
- role : user
content : "Should I bring an umbrella?"
The agent receives the full conversation history and responds to the last message. This is useful for testing context retention, memory, and whether the agent avoids redundant tool calls when context is already available.
When using messages instead of prompt, the agent’s input key will be messages rather than prompt.
Additional Parameters
Pass any additional parameters your agent accepts:
params :
prompt : "Search for products"
max_results : 10
include_reviews : true
category : "electronics"
Validating Output
The output section validates the final response from your agent:
output :
not_null! : true
type! : "string"
min_length! : 10
contains! : "success"
not_contains! : "error"
pattern! : "Order # \\ d{6}"
semantic! : "A confirmation message with order details"
Multiple validators can be combined - all must pass.
Validating Timing
The elapsed section validates total execution time in milliseconds:
elapsed :
lt! : 5000 # Less than 5 seconds
gte! : 100 # At least 100ms (not instant)
Validate specific tools by their name:
get_datetime :
input :
timezone :
eq! : "Europe/Madrid"
starts_with! : "Europe/"
output :
type! : "string"
pattern! : "^ \\ d{4}- \\ d{2}- \\ d{2}"
elapsed :
lt! : 1000
search_products :
input :
query :
contains! : "laptop"
limit :
lte! : 100
output :
type! : "array"
min_length! : 1
Validating Token Usage
Access usage metrics on LLM spans:
llm :
usage :
input_tokens :
lte! : 500
output_tokens :
lte! : 1000
The token field names depend on the model provider:
OpenAI : Use input_text_tokens and output_text_tokens (e.g., input_text_tokens: lte!: 500, output_text_tokens: lte!: 1000)
Anthropic : Use input_tokens and output_tokens (e.g., input_tokens: lte!: 500, output_tokens: lte!: 1000)
For multi-model scenarios, tokens are automatically summed across models.
Flow Validators
Sequence Validation
Use seq! to validate the order of tool execution:
seq! :
- llm
- search_products
- llm
With wildcards for flexible matching:
seq! :
- llm
- ... # Any number of spans
- send_email
Parallel Validation
Use parallel! to validate concurrent execution:
parallel! :
- get_weather
- get_datetime
- get_stock_price
Nested Flow Validation
Combine sequence and parallel:
seq! :
- llm
- parallel! :
- fetch_user
- fetch_orders
- llm
Span Validation Within Sequence
Validate span inputs/outputs within the sequence:
seq! :
- llm :
elapsed :
lt! : 3000
usage :
input_tokens :
lte! : 500
- get_datetime :
input :
timezone :
eq! : "Europe/Madrid"
output :
type! : "string"
- llm
Complete Example
- name : time_query
description : "Test time query with validation"
runnable : "agent.py::agent"
tags : [ "datetime" , "smoke" ]
timeout : 30000
params :
prompt : "what time is it in madrid"
# Validate final output
output :
type! : "string"
min_length! : 10
contains! : ":"
pattern! : " \\ d{1,2}: \\ d{2}"
semantic! : "a time response mentioning Madrid"
# Validate timing
elapsed :
lt! : 6000
# Validate execution sequence
seq! :
- llm :
elapsed :
lte! : 40000
usage :
input_tokens :
lte! : 1000
- get_datetime :
input :
timezone :
eq! : "Europe/Madrid"
starts_with! : "Europe/"
output :
type! : "string"
pattern! : "^ \\ d{4}- \\ d{2}- \\ d{2}"
- llm
Wildcard Patterns in Sequences
Pattern Description ..Exactly 1 span ...Any number of spans (0 or more) n..mBetween n and m spans n..At least n spans ..mAt most m spans
seq! :
- llm
- 1..3 # 1 to 3 spans
- validate_input
- ... # Any number
- send_response
Use tags to organize and filter evals:
- name : smoke_test_greeting
tags :
- smoke
- greeting
- quick
runnable : agent.py::agent
params :
prompt : "Hi"
output :
not_null! : true
Environment Variables
Set environment variables for specific evals:
- name : test_with_api_key
runnable : agent.py::agent
env :
API_KEY : "test-key-123"
DEBUG : "true"
params :
prompt : "Fetch data"
output :
not_null! : true
Best Practices
Names should clearly indicate what’s being tested: # Good
- name : search_returns_relevant_products
- name : handles_empty_search_results
- name : validates_timezone_input
# Avoid
- name : test1
- name : search_test
Combine validators effectively
Use multiple validators to thoroughly test behavior: output :
not_null! : true
type! : "string"
min_length! : 50
contains! : "product"
not_contains! : "error"
semantic! : "Product recommendations with prices"
Validate both output and execution
Don’t just check the output - validate how the agent got there: output :
semantic! : "Weather information for Madrid"
seq! :
- llm
- get_weather :
input :
city :
eq! : "Madrid"
- llm
Use semantic validators for natural language
For outputs that can vary in wording but should convey the same meaning: # Instead of exact matching
output :
eq! : "The current time in Madrid is 14:30."
# Use semantic validation
output :
semantic! : "Indicates the current time in Madrid"
Create specific evals for error conditions: - name : handles_invalid_timezone
runnable : agent.py::agent
params :
prompt : "what time is it in xyzland"
output :
semantic! : "Indicates the timezone or location is not recognized"
File Naming Conventions
Timbal discovers eval files matching these patterns:
eval*.yaml - e.g., eval_search.yaml, evals.yaml
*eval.yaml - e.g., search_eval.yaml, my_eval.yaml
Organize evals by feature or agent:
evals/
├── eval_search.yaml
├── eval_support.yaml
├── eval_checkout.yaml
└── regression/
├── eval_smoke.yaml
└── eval_full.yaml