Skip to main content
This guide covers the complete syntax for writing evals, from basic tests to complex multi-validator scenarios.

File Structure

Evals are defined in YAML files. Each file can contain multiple eval definitions:
# eval_search.yaml
- name: basic_search
  description: Test basic product search
  runnable: agents/search.py::search_agent
  
  params:
    prompt: "Find me a laptop"
  
  output:
    contains!: "laptop"
    type!: "string"

- name: search_with_filters
  description: Test search with price filters
  runnable: agents/search.py::search_agent
  
  params:
    prompt: "Find laptops under $1000"
  
  output:
    pattern!: "\\$\\d+"

Eval Definition

Required Fields

- name: my_eval_name
  runnable: path/to/agent.py::agent_name
Eval names must be unique across all eval files in your project. The CLI will error if duplicate names are found.

Optional Fields

- name: complete_example
  description: "A thorough test of greeting behavior"
  runnable: agent.py::agent
  tags:
    - greeting
    - smoke-test
  timeout: 30000            # Milliseconds
  env:
    API_KEY: "test-key"
  
  params:
    prompt: "Hello"
  
  output:
    semantic!: "Friendly greeting"
  
  elapsed:
    lt!: 5000

Params Structure

The params field contains input parameters passed to your runnable:

Simple Prompt

params:
  prompt: "What's the weather like?"

With Messages

params:
  messages:
    - role: user
      content: "What's the weather like?"

Additional Parameters

Pass any additional parameters your agent accepts:
params:
  prompt: "Search for products"
  max_results: 10
  include_reviews: true
  category: "electronics"

Validating Output

The output section validates the final response from your agent:
output:
  not_null!: true
  type!: "string"
  min_length!: 10
  contains!: "success"
  not_contains!: "error"
  pattern!: "Order #\\d{6}"
  semantic!: "A confirmation message with order details"
Multiple validators can be combined - all must pass.

Validating Timing

The elapsed section validates total execution time in milliseconds:
elapsed:
  lt!: 5000       # Less than 5 seconds
  gte!: 100       # At least 100ms (not instant)

Validating Tool Spans

Validate specific tools by their name:
get_datetime:
  input:
    timezone:
      eq!: "Europe/Madrid"
      starts_with!: "Europe/"
  output:
    type!: "string"
    pattern!: "^\\d{4}-\\d{2}-\\d{2}"
  elapsed:
    lt!: 1000

search_products:
  input:
    query:
      contains!: "laptop"
    limit:
      lte!: 100
  output:
    type!: "array"
    min_length!: 1

Validating Token Usage

Access usage metrics on LLM spans:
llm:
  usage:
    input_tokens:
      lte!: 500
    output_tokens:
      lte!: 1000
For multi-model scenarios, tokens are automatically summed across models.

Flow Validators

Sequence Validation

Use seq! to validate the order of tool execution:
seq!:
  - llm
  - search_products
  - llm
With wildcards for flexible matching:
seq!:
  - llm
  - ...              # Any number of spans
  - send_email

Parallel Validation

Use parallel! to validate concurrent execution:
parallel!:
  - get_weather
  - get_datetime
  - get_stock_price

Nested Flow Validation

Combine sequence and parallel:
seq!:
  - llm
  - parallel!:
      - fetch_user
      - fetch_orders
  - llm

Span Validation Within Sequence

Validate span inputs/outputs within the sequence:
seq!:
  - llm:
      elapsed:
        lt!: 3000
      usage:
        input_tokens:
          lte!: 500
  
  - get_datetime:
      input:
        timezone:
          eq!: "Europe/Madrid"
      output:
        type!: "string"
  
  - llm

Complete Example

- name: time_query
  description: "Test time query with validation"
  runnable: "agent.py::agent"
  tags: ["datetime", "smoke"]
  timeout: 30000
  
  params:
    prompt: "what time is it in madrid"
  
  # Validate final output
  output:
    type!: "string"
    min_length!: 10
    contains!: ":"
    pattern!: "\\d{1,2}:\\d{2}"
    semantic!: "a time response mentioning Madrid"
  
  # Validate timing
  elapsed:
    lt!: 6000
  
  # Validate execution sequence
  seq!:
    - llm:
        elapsed:
          lte!: 40000
        usage:
          input_tokens:
            lte!: 1000
    
    - get_datetime:
        input:
          timezone:
            eq!: "Europe/Madrid"
            starts_with!: "Europe/"
        output:
          type!: "string"
          pattern!: "^\\d{4}-\\d{2}-\\d{2}"
    
    - llm

Wildcard Patterns in Sequences

PatternDescription
..Exactly 1 span
...Any number of spans (0 or more)
n..mBetween n and m spans
n..At least n spans
..mAt most m spans
seq!:
  - llm
  - 1..3              # 1 to 3 spans
  - validate_input
  - ...               # Any number
  - send_response

Tags and Filtering

Use tags to organize and filter evals:
- name: smoke_test_greeting
  tags:
    - smoke
    - greeting
    - quick
  runnable: agent.py::agent
  params:
    prompt: "Hi"
  output:
    not_null!: true

Environment Variables

Set environment variables for specific evals:
- name: test_with_api_key
  runnable: agent.py::agent
  env:
    API_KEY: "test-key-123"
    DEBUG: "true"
  params:
    prompt: "Fetch data"
  output:
    not_null!: true

Best Practices

Names should clearly indicate what’s being tested:
# Good
- name: search_returns_relevant_products
- name: handles_empty_search_results
- name: validates_timezone_input

# Avoid
- name: test1
- name: search_test
Use multiple validators to thoroughly test behavior:
output:
  not_null!: true
  type!: "string"
  min_length!: 50
  contains!: "product"
  not_contains!: "error"
  semantic!: "Product recommendations with prices"
Don’t just check the output - validate how the agent got there:
output:
  semantic!: "Weather information for Madrid"

seq!:
  - llm
  - get_weather:
      input:
        city:
          eq!: "Madrid"
  - llm
For outputs that can vary in wording but should convey the same meaning:
# Instead of exact matching
output:
  eq!: "The current time in Madrid is 14:30."

# Use semantic validation
output:
  semantic!: "Indicates the current time in Madrid"
Create specific evals for error conditions:
- name: handles_invalid_timezone
  runnable: agent.py::agent
  params:
    prompt: "what time is it in xyzland"
  output:
    semantic!: "Indicates the timezone or location is not recognized"

File Naming Conventions

Timbal discovers eval files matching these patterns:
  • eval*.yaml - e.g., eval_search.yaml, evals.yaml
  • *eval.yaml - e.g., search_eval.yaml, my_eval.yaml
Organize evals by feature or agent:
evals/
├── eval_search.yaml
├── eval_support.yaml
├── eval_checkout.yaml
└── regression/
    ├── eval_smoke.yaml
    └── eval_full.yaml