Skip to main content
Evals are automated tests that validate your agent’s behavior, outputs, and execution patterns. They help you ensure your agents perform correctly and consistently across different scenarios.

Why Evals Matter

AI agents are non-deterministic - the same input can yield different results. Evals help you:
  • Validate outputs: Ensure agents produce correct responses
  • Check tool usage: Verify agents use the right tools with correct inputs
  • Monitor performance: Track execution time and token usage
  • Catch regressions: Prevent breaking changes during development
  • Test execution patterns: Validate sequential and parallel tool execution

How Evals Work

Timbal’s eval system uses a YAML-based test definition format with a powerful validator system:
- name: time_in_madrid
  description: Test that agent returns the time in Madrid
  runnable: agent.py::agent
  tags: ["datetime", "smoke"]
  timeout: 30000
  
  params:
    prompt: "what time is it in madrid"
  
  output:
    type!: "string"
    contains!: ":"
    pattern!: "\\d{1,2}:\\d{2}"
  
  elapsed:
    lt!: 6000
  
  seq!:
    - llm
    - get_datetime
    - llm

Core Components

Quick Start

1. Create a test file

Create a file named eval_greeting.yaml:
- name: greeting_test
  description: Verify the agent greets users appropriately
  runnable: agent.py::my_agent
  
  params:
    prompt: "Hi there!"
  
  output:
    not_null!: true
    type!: "string"
    semantic!: "A polite greeting that acknowledges the user"
  
  elapsed:
    lt!: 5000

2. Run your evals

python -m timbal.evals.cli path/to/eval_greeting.yaml

3. View results

The CLI displays pytest-style output with pass/fail status, duration, and detailed failure information:
========================= timbal evals =========================
collected 1 eval

eval_greeting.yaml
  greeting_test ......................................... PASSED (0.45s)
    tags: greeting, basic
    ├── output
    │   ├── not_null! ✓
    │   ├── type! ✓
    │   └── semantic! ✓
    └── elapsed
        └── lt! ✓

========================= 1 passed in 0.45s =========================

Eval Structure

Each eval consists of:
FieldDescriptionRequired
nameUnique identifier for the evalYes
runnablePath to the runnable (file.py::name)Yes
paramsInput parameters for the runnableNo
descriptionHuman-readable descriptionNo
tagsList of tags for filteringNo
timeoutMaximum execution time in millisecondsNo
outputValidators for the final outputNo
elapsedValidators for total execution timeNo
seq!Sequence flow validatorNo
Eval names must be unique across all eval files. The CLI will error if duplicate names are found.
See Writing Evals for the complete syntax reference.

Next Steps