Running Evals

Timbal provides a command-line interface for discovering and running evals. This guide covers all CLI options and integration patterns.

Basic Usage

Run evals from the command line:

# Run all evals in a directory
python -m timbal.evals.cli path/to/evals/

# Run a specific eval file
python -m timbal.evals.cli path/to/eval_search.yaml

# Run a single eval by name (pytest-style)
python -m timbal.evals.cli path/to/eval_search.yaml::my_eval_name

# Run with a specific runnable
python -m timbal.evals.cli --runnable agent.py::my_agent path/to/evals/

CLI Options

Option	Description
`path`	Path to eval file or directory (positional)
`--runnable`	Fully qualified name of the runnable (`file.py::name`)
`--log-level`	Logging level (DEBUG, INFO, WARNING, ERROR)
`-s`, `--no-capture`	Disable stdout/stderr capture (show output live)
`-t`, `--tags`	Filter evals by tags (comma-separated)
`-V`, `--version`	Show version information

Runnable Specification

The runnable can be specified in two ways: 1. Via CLI flag:

python -m timbal.evals.cli --runnable agents/search.py::search_agent evals/

2. Per-eval in YAML:

- name: test_search
  runnable: agents/search.py::search_agent
  params:
    prompt: "Find products"
  output:
    not_null!: true

Per-eval runnable overrides the CLI flag.

Tag Filtering

Filter evals by tags using -t or --tags:

# Run only evals with 'smoke' tag
python -m timbal.evals.cli -t smoke evals/

# Run evals matching ANY of the specified tags
python -m timbal.evals.cli -t smoke,fast evals/

# Combine with other options
python -m timbal.evals.cli --tags regression -s evals/

Evals matching any of the specified tags will run.

Output Capture

By default, stdout/stderr from your agent is captured and only shown on failure. Use -s to see output live:

# Captured (default) - cleaner output
python -m timbal.evals.cli evals/

# Live output - useful for debugging
python -m timbal.evals.cli -s evals/

File Discovery

Naming Patterns

Timbal discovers eval files matching these patterns:

eval*.yaml - e.g., eval_search.yaml, evals.yaml
*eval.yaml - e.g., search_eval.yaml, my_eval.yaml

Directory Structure

project/
├── agents/
│   ├── search.py
│   └── support.py
└── evals/
    ├── eval_search.yaml
    ├── eval_support.yaml
    └── regression/
        ├── eval_smoke.yaml
        └── eval_full.yaml

Run all evals:

python -m timbal.evals.cli evals/

Run specific subset:

python -m timbal.evals.cli evals/regression/

Configuration File

Create an evalconf.yaml in your project root for shared configuration:

# evalconf.yaml
runnable: agents/main.py::agent
log_level: INFO

Timbal walks up the directory tree looking for evalconf.yaml.

Output Format

Successful Run

========================= timbal evals =========================
collected 5 evals

eval_search.yaml
  basic_search .......................................... PASSED (0.45s)
    tags: search, smoke
    ├── output
    │   ├── not_null! ✓
    │   └── contains! ✓
    └── elapsed
        └── lt! ✓

  advanced_search ....................................... PASSED (0.82s)
    tags: search, regression
    ├── output
    │   ├── semantic! ✓
    │   └── min_length! ✓
    └── seq!
        ├── llm ✓
        ├── search_products ✓
        └── llm ✓

========================= 5 passed in 2.34s =========================

Failed Run

========================= timbal evals =========================
collected 3 evals

eval_validation.yaml
  input_validation ...................................... FAILED (0.23s)
    tags: validation
    ├── output
    │   ├── not_null! ✓
    │   └── contains! ✗

========================= FAILURES =========================

eval_validation.yaml::input_validation
-----------------------------------------
Captured stdout:
  Processing input...
  
Captured stderr:
  Warning: deprecated function used

Failed validators:
  - output -> contains!
    Expected: "validated"
    Actual: "The input was processed successfully"
    Error: Value does not contain expected substring

========================= 1 failed, 2 passed in 1.56s =========================

Exit Codes

Code	Meaning
0	All evals passed
1	One or more evals failed
2	Configuration or setup error

CI/CD Integration

GitHub Actions

# .github/workflows/evals.yaml
name: Run Evals

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -e .
          pip install timbal
      
      - name: Run evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python -m timbal.evals.cli --runnable agent.py::agent evals/

GitLab CI

# .gitlab-ci.yml
evals:
  stage: test
  image: python:3.11
  script:
    - pip install -e .
    - pip install timbal
    - python -m timbal.evals.cli --runnable agent.py::agent evals/
  variables:
    OPENAI_API_KEY: $OPENAI_API_KEY

Pre-commit Hook

#!/bin/bash
# .git/hooks/pre-commit

echo "Running evals..."
python -m timbal.evals.cli evals/smoke/

if [ $? -ne 0 ]; then
    echo "Evals failed. Commit aborted."
    exit 1
fi

Debugging

Verbose Output

# Show debug logging
python -m timbal.evals.cli --log-level DEBUG evals/

# Show live output
python -m timbal.evals.cli -s evals/

Run Single Eval

Test a specific eval during development:

# Run one file
python -m timbal.evals.cli evals/eval_search.yaml

# Run a specific eval by name (pytest-style syntax)
python -m timbal.evals.cli evals/eval_search.yaml::basic_search

Environment Variables

Set environment variables for testing:

# Via shell
export API_KEY="test-key"
python -m timbal.evals.cli evals/

# Or in eval file
- name: test_with_key
  runnable: agent.py::agent
  env:
    API_KEY: "test-key"
  params:
    prompt: "Fetch data"
  output:
    not_null!: true

Best Practices

Organize evals by speed

Separate fast smoke tests from slow regression tests:

evals/
├── smoke/           # Fast tests, run on every commit
│   └── eval_basic.yaml
└── regression/      # Comprehensive tests, run nightly
    └── eval_full.yaml

# Quick check
python -m timbal.evals.cli evals/smoke/

# Full suite
python -m timbal.evals.cli evals/

Use meaningful exit codes

Check exit codes in scripts:

python -m timbal.evals.cli evals/
status=$?

if [ $status -eq 0 ]; then
    echo "All evals passed!"
elif [ $status -eq 1 ]; then
    echo "Some evals failed"
    exit 1
else
    echo "Configuration error"
    exit 2
fi

Set timeouts

Prevent hanging tests with timeouts (in milliseconds):

- name: slow_test
  runnable: agent.py::agent
  timeout: 60000  # 60 second timeout
  params:
    prompt: "Complex query"
  output:
    not_null!: true

Use tags for filtering

Organize evals with tags:

- name: smoke_test
  runnable: agent.py::agent
  tags:
    - smoke
    - quick
  params:
    prompt: "Hi"
  output:
    not_null!: true

Getting started

Core Concepts

Agents

Workflows

Evals

Knowledge Bases

Deployment

Basic Usage

CLI Options

Runnable Specification

Tag Filtering

Output Capture

File Discovery

Naming Patterns

Directory Structure

Configuration File

Output Format

Successful Run

Failed Run

Exit Codes

CI/CD Integration

GitHub Actions

GitLab CI

Pre-commit Hook

Debugging

Verbose Output

Run Single Eval

Environment Variables

Best Practices

Getting started

Core Concepts

Agents

Workflows

Evals

Knowledge Bases

Deployment

​Basic Usage

​CLI Options

​Runnable Specification

​Tag Filtering

​Output Capture

​File Discovery

​Naming Patterns

​Directory Structure

​Configuration File

​Output Format

​Successful Run

​Failed Run

​Exit Codes

​CI/CD Integration

​GitHub Actions

​GitLab CI

​Pre-commit Hook

​Debugging

​Verbose Output

​Run Single Eval

​Environment Variables

​Best Practices

Basic Usage

CLI Options

Runnable Specification

Tag Filtering

Output Capture

File Discovery

Naming Patterns

Directory Structure

Configuration File

Output Format

Successful Run

Failed Run

Exit Codes

CI/CD Integration

GitHub Actions

GitLab CI

Pre-commit Hook

Debugging

Verbose Output

Run Single Eval

Environment Variables

Best Practices