Multi-turn Conversation Testing

Multi-turn conversation testing ensures your agent maintains context across multiple interactions. Only the last turn is validated; previous turns establish conversation history and memory.

Weather Assistant Agent

agent.py

from timbal import Agent

def get_weather(location: str) -> str:
    """Get current weather information for a specific location."""
    weather_data = {
        "New York": "15°C and raining",
        "London": "12°C and cloudy",
        "Tokyo": "22°C and sunny"
    }
    return weather_data.get(location, f"Weather data not available for {location}")

agent = Agent(
    name="weather_agent",
    model="openai/gpt-4.1",
    system_prompt="You are a helpful weather assistant.",
    tools=[get_weather],    
)

Running Evaluations

python -m timbal.eval --fqn agent.py::agent --tests evals.yaml

Validation Configuration

evals.yaml

- name: eval_weather_advice
  description: Test agent uses weather data to provide appropriate advice
  turns:
    - input: "What's the weather like in New York?"
      output: "It's currently 15°C and raining in New York."
    - input: "Should I bring an umbrella?"
      steps:
        validators:
          not_contains:
            - name: get_weather
      output:
        content:
          validators:
            contains:
              - "yes"

How It Works

First Turn (Context): The first turn has no validators. It establishes conversation history by providing the user input and expected output. The agent receives this information but it’s not validated.
Second Turn (Validation): The last turn has validators for steps and output.
Memory: The agent receives the entire conversation history, so it can remember what was said in previous turns and answer accordingly.

Evaluation Results

Successful Validation

When the agent remembers context and provides the correct response:

summary.json

{
  "total_files": 1,
  "total_tests": 1,
  "total_validations": 2,
  "inputs_passed": 0,
  "inputs_failed": 0,
  "outputs_passed": 1,
  "outputs_failed": 0,
  "steps_passed": 1,
  "steps_failed": 0,
  "execution_errors": 0,
  "tests_failed": []
}

Failed Validation

When the agent doesn’t use context or provides an incorrect response:

summary.json

{
  "total_files": 1,
  "total_tests": 1,
  "total_validations": 2,
  "inputs_passed": 0,
  "inputs_failed": 0,
  "outputs_passed": 0,
  "outputs_failed": 1,
  "steps_passed": 1,
  "steps_failed": 0,
  "execution_errors": 0,
  "tests_failed": [
    {
      "test_name": "eval_weather_advice",
      "test_path": "evals.yaml::eval_weather_advice",
      "input": {
        "prompt": [
          "Should I bring an umbrella?"
        ]
      },
      "reason": [
        "output"
      ],
      "execution_error": null,
      "input_passed": null,
      "input_explanations": [],
      "output_passed": false,
      "output_explanations": [
        "Validator contains: Message does not contain 'yes'."
      ],
      "actual_output": {
        "text": "I can help you decide! Could you please tell me your location so I can check the weather report for you?",
        "files": []
      },
      "expected_output": {
        "content": {
          "validators": {
            "contains": [
              "yes"
            ]
          }
        }
      },
      "steps_passed": true,
      "steps_explanations": [],
      "actual_steps": [],
      "expected_steps": {
        "not_contains": [
          {
            "name": "get_weather"
          }
        ]
      }
    }
  ]
}

Key Features

Context Memory: Previous turns establish conversation history that agents can access
Last Turn Validation: Only the final turn is validated; earlier turns provide context
Tool Usage Control: Verify agents don’t call tools unnecessarily when context is available
Response Continuity: Ensure agents build logically on previous interactions

Guides

Agents

Workflows

Multi-turn Conversation Testing

Weather Assistant Agent

Running Evaluations

Validation Configuration

How It Works

Evaluation Results

Successful Validation

Failed Validation

Key Features

Guides

Agents

Workflows

​Weather Assistant Agent

​Running Evaluations

​Validation Configuration

​How It Works

​Evaluation Results

​Successful Validation

​Failed Validation

​Key Features

Weather Assistant Agent

Running Evaluations

Validation Configuration

How It Works

Evaluation Results

Successful Validation

Failed Validation

Key Features