Skip to main content
This guide covers the complete syntax for writing evals, from basic tests to complex multi-validator scenarios.

File Structure

Evals are defined in YAML files. Each file can contain multiple eval definitions:
# eval_search.yaml
- name: basic_search
  description: Test basic product search
  runnable: agents/search.py::search_agent
  
  params:
    prompt: "Find me a laptop"
  
  output:
    contains!: "laptop"
    type!: "string"

- name: search_with_filters
  description: Test search with price filters
  runnable: agents/search.py::search_agent
  
  params:
    prompt: "Find laptops under $1000"
  
  output:
    pattern!: "\\$\\d+"

Eval Definition

Required Fields

- name: my_eval_name
  runnable: path/to/agent.py::agent_name
Eval names must be unique across all eval files in your project. The CLI will error if duplicate names are found.

Optional Fields

- name: complete_example
  description: "A thorough test of greeting behavior"
  runnable: agent.py::agent
  tags:
    - greeting
    - smoke-test
  timeout: 30000            # Milliseconds
  env:
    API_KEY: "test-key"
  
  params:
    prompt: "Hello"
  
  output:
    semantic!: "Friendly greeting"
  
  elapsed:
    lt!: 5000

Params Structure

The params field contains input parameters passed to your runnable:

Simple Prompt

params:
  prompt: "What's the weather like?"

With Messages

Use messages to establish conversation history for multi-turn testing:
params:
  messages:
    - role: user
      content: "What's the weather like in New York?"
    - role: assistant
      content: "It's currently 15°C and raining in New York."
    - role: user
      content: "Should I bring an umbrella?"
The agent receives the full conversation history and responds to the last message. This is useful for testing context retention, memory, and whether the agent avoids redundant tool calls when context is already available.
When using messages instead of prompt, the agent’s input key will be messages rather than prompt.

Additional Parameters

Pass any additional parameters your agent accepts:
params:
  prompt: "Search for products"
  max_results: 10
  include_reviews: true
  category: "electronics"

Validating Output

The output section validates the final response from your agent:
output:
  not_null!: true
  type!: "string"
  min_length!: 10
  contains!: "success"
  not_contains!: "error"
  pattern!: "Order #\\d{6}"
  semantic!: "A confirmation message with order details"
Multiple validators can be combined - all must pass.

Validating Timing

The elapsed section validates total execution time in milliseconds:
elapsed:
  lt!: 5000       # Less than 5 seconds
  gte!: 100       # At least 100ms (not instant)

Validating Tool Spans

Validate specific tools by their name:
get_datetime:
  input:
    timezone:
      eq!: "Europe/Madrid"
      starts_with!: "Europe/"
  output:
    type!: "string"
    pattern!: "^\\d{4}-\\d{2}-\\d{2}"
  elapsed:
    lt!: 1000

search_products:
  input:
    query:
      contains!: "laptop"
    limit:
      lte!: 100
  output:
    type!: "array"
    min_length!: 1

Validating Token Usage

Access usage metrics on LLM spans:
llm:
  usage:
    input_tokens:
      lte!: 500
    output_tokens:
      lte!: 1000
The token field names depend on the model provider:
  • OpenAI: Use input_text_tokens and output_text_tokens (e.g., input_text_tokens: lte!: 500, output_text_tokens: lte!: 1000)
  • Anthropic: Use input_tokens and output_tokens (e.g., input_tokens: lte!: 500, output_tokens: lte!: 1000)
For multi-model scenarios, tokens are automatically summed across models.

Flow Validators

Sequence Validation

Use seq! to validate the order of tool execution:
seq!:
  - llm
  - search_products
  - llm
With wildcards for flexible matching:
seq!:
  - llm
  - ...              # Any number of spans
  - send_email

Parallel Validation

Use parallel! to validate concurrent execution:
parallel!:
  - get_weather
  - get_datetime
  - get_stock_price

Nested Flow Validation

Combine sequence and parallel:
seq!:
  - llm
  - parallel!:
      - fetch_user
      - fetch_orders
  - llm

Span Validation Within Sequence

Validate span inputs/outputs within the sequence:
seq!:
  - llm:
      elapsed:
        lt!: 3000
      usage:
        input_tokens:
          lte!: 500
  
  - get_datetime:
      input:
        timezone:
          eq!: "Europe/Madrid"
      output:
        type!: "string"
  
  - llm

Complete Example

- name: time_query
  description: "Test time query with validation"
  runnable: "agent.py::agent"
  tags: ["datetime", "smoke"]
  timeout: 30000
  
  params:
    prompt: "what time is it in madrid"
  
  # Validate final output
  output:
    type!: "string"
    min_length!: 10
    contains!: ":"
    pattern!: "\\d{1,2}:\\d{2}"
    semantic!: "a time response mentioning Madrid"
  
  # Validate timing
  elapsed:
    lt!: 6000
  
  # Validate execution sequence
  seq!:
    - llm:
        elapsed:
          lte!: 40000
        usage:
          input_tokens:
            lte!: 1000
    
    - get_datetime:
        input:
          timezone:
            eq!: "Europe/Madrid"
            starts_with!: "Europe/"
        output:
          type!: "string"
          pattern!: "^\\d{4}-\\d{2}-\\d{2}"
    
    - llm

Wildcard Patterns in Sequences

PatternDescription
..Exactly 1 span
...Any number of spans (0 or more)
n..mBetween n and m spans
n..At least n spans
..mAt most m spans
seq!:
  - llm
  - 1..3              # 1 to 3 spans
  - validate_input
  - ...               # Any number
  - send_response

Tags and Filtering

Use tags to organize and filter evals:
- name: smoke_test_greeting
  tags:
    - smoke
    - greeting
    - quick
  runnable: agent.py::agent
  params:
    prompt: "Hi"
  output:
    not_null!: true

Environment Variables

Set environment variables for specific evals:
- name: test_with_api_key
  runnable: agent.py::agent
  env:
    API_KEY: "test-key-123"
    DEBUG: "true"
  params:
    prompt: "Fetch data"
  output:
    not_null!: true

Best Practices

Names should clearly indicate what’s being tested:
# Good
- name: search_returns_relevant_products
- name: handles_empty_search_results
- name: validates_timezone_input

# Avoid
- name: test1
- name: search_test
Use multiple validators to thoroughly test behavior:
output:
  not_null!: true
  type!: "string"
  min_length!: 50
  contains!: "product"
  not_contains!: "error"
  semantic!: "Product recommendations with prices"
Don’t just check the output - validate how the agent got there:
output:
  semantic!: "Weather information for Madrid"

seq!:
  - llm
  - get_weather:
      input:
        city:
          eq!: "Madrid"
  - llm
For outputs that can vary in wording but should convey the same meaning:
# Instead of exact matching
output:
  eq!: "The current time in Madrid is 14:30."

# Use semantic validation
output:
  semantic!: "Indicates the current time in Madrid"
Create specific evals for error conditions:
- name: handles_invalid_timezone
  runnable: agent.py::agent
  params:
    prompt: "what time is it in xyzland"
  output:
    semantic!: "Indicates the timezone or location is not recognized"

File Naming Conventions

Timbal discovers eval files matching these patterns:
  • eval*.yaml - e.g., eval_search.yaml, evals.yaml
  • *eval.yaml - e.g., search_eval.yaml, my_eval.yaml
Organize evals by feature or agent:
evals/
├── eval_search.yaml
├── eval_support.yaml
├── eval_checkout.yaml
└── regression/
    ├── eval_smoke.yaml
    └── eval_full.yaml