Automated Testing for Timbal Agents
Measure, test, and improve your agents with automated LLM-powered validation and comprehensive reporting.
AI outputs are non-deterministic — the same input can yield different results. Evals in Timbal help you measure, test, and improve your agents and flows with automated, LLM-powered checks.
What Are Evals in Timbal?
Evals are automated tests that assess your agent's outputs and tool usage. They use LLMs to compare your agent's process (the tools it used and how) and its result (the answer it gave) to the expected process and result. Evals can be run locally, in the cloud, or as part of your CI/CD pipeline.
Types of Evals
- Output Evals: Did the agent produce the correct answer, with proper content and formatting?
- Steps Evals: Did the agent use the right tools, in the right order, with the right inputs?
- Usage Evals: Did the agent's resource consumption (tokens, API calls) meet expectations or stay within expected bounds?
How Evals Work
Core Concepts
Test Suite
A collection of test cases defined in YAML format. A test suite can contain multiple tests that validate different aspects of your agent's behavior. Timbal automatically discovers all files matching the pattern eval*.yaml
in your test directory.
Test
A single test case with multiple turns. Each test focuses on a specific scenario or functionality of your agent. Tests are defined within YAML files and can be run individually or as part of a suite.
Turn
One interaction between user and agent. A test can contain multiple turns to validate multi-step conversations. Each turn consists of:
- Input: What the user says or asks (text and optionally files) - required
- Output: What the agent should respond (validated against expected content) - optional
- Steps: Tools the agent should use (validated against expected tool calls) - optional
- Usage: Resource consumption limits (token counts) - optional
Validators
Programmatic checks that compare actual vs expected behavior. Validators use both exact matching and LLM-powered semantic evaluation to assess correctness.
Test Structure
Evals are defined in YAML files with the following structure:
Fields Reference
Test Level
name
: Unique identifier for the test (required)description
: Human-readable description of what the test validates (optional)turns
: Array of user-agent interactions to test (required)
Turn Level
-
input
: The user's message or query (required)text
: Text content of the messagefiles
: Array of file paths to attach (optional)
Note: You can use shorthand syntax:
input: "Your message"
is equivalent toinput: { text: "Your message" }
-
output
: Expected agent response (optional)text
: Response text to store in conversation memory (optional)
Note: You can use shorthand syntax:
output: "Response text"
is equivalent tooutput: { text: "Response text" }
-
validators
: Validation rules for the output (optional) -
steps
: Expected tool usage (optional)validators
: Validation rules for tool calls (optional)
-
usage
: Resource consumption limits (optional)
Validators
Timbal provides several types of validators for comprehensive testing:
Output Validators
Validator | Description | Example Usage |
---|---|---|
contains | Checks if output includes specified substrings | contains: ["hello", "world"] |
not_contains | Checks if output does NOT include specified substrings | not_contains: ["error", "failed"] |
regex | Checks if output matches a regular expression pattern | regex: "^Success: .+" |
semantic | Uses LLM to validate semantic correctness against reference | semantic: "Should greet user politely" |
Steps Validators
Validator | Description | Example Usage |
---|---|---|
contains | Checks if steps include specified tool calls with inputs | contains: [{"name": "search", "input": {"query": "test"}}] |
not_contains | Checks if steps do NOT include specified tool calls | not_contains: [{"name": "delete_file"}] |
semantic | Uses LLM to validate tool usage against expected behavior | semantic: "Should search before providing answer" |
Usage Validators
Usage validators monitor resource consumption:
Running Evals
Command Line Interface
Command Options
--fqn
: Fully qualified name of your agent (format:file.py::agent_name
)--tests
: Path to test file, directory, or specific test (format:path/file.yaml::test_name
)
Examples
Example 1: Product Search with Complete Validation
How this test works:
- Input: User asks for measurements of product H214E/1
- Steps Validation:
contains
: Verifiessearch_by_reference
tool was called with correct parameterssemantic
: Uses LLM to verify search logic was appropriate
- Output Validation:
contains
: Checks for specific measurement values and product codenot_contains
: Ensures no error messages appearsemantic
: Validates that response provides comprehensive product information
- Usage Validation: Monitors token consumption within specified limits
Example 2: Multi-turn Conversation with Memory
How multi-turn tests work:
- Turn 1: Establishes context (input and output only) by providing information to remember - no validators needed
- Turn 2: Tests memory by asking for previously provided information - contains validators to verify behavior
- Memory validation: Ensures agent retrieves information from conversation history rather than external sources
Example 3: File Processing with Error Handling
Understanding Eval Results
Timbal generates comprehensive evaluation results in JSON format:
Result Analysis
- Summary metrics: Total counts of passed/failed validations across all test types
- Detailed failures: Complete information about each failed test including:
- Actual vs Expected: What the agent actually did vs what was expected
- Explanations: Detailed reasons for failures from each validator
- Execution errors: Runtime errors during test execution
- Usage monitoring: Resource consumption tracking for cost and performance optimization
Important: The eval system automatically overrides your agent's state saver configuration. During evaluation, your agent will use JSONLSaver
regardless of its original configuration.
The eval system automatically sets: agent.state_saver = JSONLSaver(path=Path("state.jsonl"))
Summary
Timbal's evaluation framework provides:
- Comprehensive Testing: Validate outputs, tool usage, and resource consumption
- Flexible Validation: From exact string matching to semantic LLM-powered checks
- Multi-turn Support: Test complex conversational flows and memory retention
- Detailed Reporting: Rich failure analysis for debugging and improvement
- CI/CD Integration: Automated testing to prevent regressions
This evaluation system helps you build reliable, testable AI agents that consistently produce correct results and follow expected processes, giving you confidence in your agent's behavior across different scenarios and edge cases.