This guide covers the complete syntax for writing evals, from basic tests to complex multi-validator scenarios.
File Structure
Evals are defined in YAML files. Each file can contain multiple eval definitions:
# eval_search.yaml
- name : basic_search
description : Test basic product search
runnable : agents/search.py::search_agent
params :
prompt : "Find me a laptop"
output :
contains! : "laptop"
type! : "string"
- name : search_with_filters
description : Test search with price filters
runnable : agents/search.py::search_agent
params :
prompt : "Find laptops under $1000"
output :
pattern! : " \\ $ \\ d+"
Eval Definition
Required Fields
- name : my_eval_name
runnable : path/to/agent.py::agent_name
Eval names must be unique across all eval files in your project. The CLI will error if duplicate names are found.
Optional Fields
- name : complete_example
description : "A thorough test of greeting behavior"
runnable : agent.py::agent
tags :
- greeting
- smoke-test
timeout : 30000 # Milliseconds
env :
API_KEY : "test-key"
params :
prompt : "Hello"
output :
semantic! : "Friendly greeting"
elapsed :
lt! : 5000
Params Structure
The params field contains input parameters passed to your runnable:
Simple Prompt
params :
prompt : "What's the weather like?"
With Messages
params :
messages :
- role : user
content : "What's the weather like?"
Additional Parameters
Pass any additional parameters your agent accepts:
params :
prompt : "Search for products"
max_results : 10
include_reviews : true
category : "electronics"
Validating Output
The output section validates the final response from your agent:
output :
not_null! : true
type! : "string"
min_length! : 10
contains! : "success"
not_contains! : "error"
pattern! : "Order # \\ d{6}"
semantic! : "A confirmation message with order details"
Multiple validators can be combined - all must pass.
Validating Timing
The elapsed section validates total execution time in milliseconds:
elapsed :
lt! : 5000 # Less than 5 seconds
gte! : 100 # At least 100ms (not instant)
Validate specific tools by their name:
get_datetime :
input :
timezone :
eq! : "Europe/Madrid"
starts_with! : "Europe/"
output :
type! : "string"
pattern! : "^ \\ d{4}- \\ d{2}- \\ d{2}"
elapsed :
lt! : 1000
search_products :
input :
query :
contains! : "laptop"
limit :
lte! : 100
output :
type! : "array"
min_length! : 1
Validating Token Usage
Access usage metrics on LLM spans:
llm :
usage :
input_tokens :
lte! : 500
output_tokens :
lte! : 1000
For multi-model scenarios, tokens are automatically summed across models.
Flow Validators
Sequence Validation
Use seq! to validate the order of tool execution:
seq! :
- llm
- search_products
- llm
With wildcards for flexible matching:
seq! :
- llm
- ... # Any number of spans
- send_email
Parallel Validation
Use parallel! to validate concurrent execution:
parallel! :
- get_weather
- get_datetime
- get_stock_price
Nested Flow Validation
Combine sequence and parallel:
seq! :
- llm
- parallel! :
- fetch_user
- fetch_orders
- llm
Span Validation Within Sequence
Validate span inputs/outputs within the sequence:
seq! :
- llm :
elapsed :
lt! : 3000
usage :
input_tokens :
lte! : 500
- get_datetime :
input :
timezone :
eq! : "Europe/Madrid"
output :
type! : "string"
- llm
Complete Example
- name : time_query
description : "Test time query with validation"
runnable : "agent.py::agent"
tags : [ "datetime" , "smoke" ]
timeout : 30000
params :
prompt : "what time is it in madrid"
# Validate final output
output :
type! : "string"
min_length! : 10
contains! : ":"
pattern! : " \\ d{1,2}: \\ d{2}"
semantic! : "a time response mentioning Madrid"
# Validate timing
elapsed :
lt! : 6000
# Validate execution sequence
seq! :
- llm :
elapsed :
lte! : 40000
usage :
input_tokens :
lte! : 1000
- get_datetime :
input :
timezone :
eq! : "Europe/Madrid"
starts_with! : "Europe/"
output :
type! : "string"
pattern! : "^ \\ d{4}- \\ d{2}- \\ d{2}"
- llm
Wildcard Patterns in Sequences
Pattern Description ..Exactly 1 span ...Any number of spans (0 or more) n..mBetween n and m spans n..At least n spans ..mAt most m spans
seq! :
- llm
- 1..3 # 1 to 3 spans
- validate_input
- ... # Any number
- send_response
Use tags to organize and filter evals:
- name : smoke_test_greeting
tags :
- smoke
- greeting
- quick
runnable : agent.py::agent
params :
prompt : "Hi"
output :
not_null! : true
Environment Variables
Set environment variables for specific evals:
- name : test_with_api_key
runnable : agent.py::agent
env :
API_KEY : "test-key-123"
DEBUG : "true"
params :
prompt : "Fetch data"
output :
not_null! : true
Best Practices
Names should clearly indicate what’s being tested: # Good
- name : search_returns_relevant_products
- name : handles_empty_search_results
- name : validates_timezone_input
# Avoid
- name : test1
- name : search_test
Combine validators effectively
Use multiple validators to thoroughly test behavior: output :
not_null! : true
type! : "string"
min_length! : 50
contains! : "product"
not_contains! : "error"
semantic! : "Product recommendations with prices"
Validate both output and execution
Don’t just check the output - validate how the agent got there: output :
semantic! : "Weather information for Madrid"
seq! :
- llm
- get_weather :
input :
city :
eq! : "Madrid"
- llm
Use semantic validators for natural language
For outputs that can vary in wording but should convey the same meaning: # Instead of exact matching
output :
eq! : "The current time in Madrid is 14:30."
# Use semantic validation
output :
semantic! : "Indicates the current time in Madrid"
Create specific evals for error conditions: - name : handles_invalid_timezone
runnable : agent.py::agent
params :
prompt : "what time is it in xyzland"
output :
semantic! : "Indicates the timezone or location is not recognized"
File Naming Conventions
Timbal discovers eval files matching these patterns:
eval*.yaml - e.g., eval_search.yaml, evals.yaml
*eval.yaml - e.g., search_eval.yaml, my_eval.yaml
Organize evals by feature or agent:
evals/
├── eval_search.yaml
├── eval_support.yaml
├── eval_checkout.yaml
└── regression/
├── eval_smoke.yaml
└── eval_full.yaml