Skip to main content
Multi-turn conversation testing ensures your agent maintains context across multiple interactions. Only the last turn is validated; previous turns establish conversation history and memory.

Weather Assistant Agent

agent.py
from timbal import Agent

def get_weather(location: str) -> str:
    """Get current weather information for a specific location."""
    weather_data = {
        "New York": "15°C and raining",
        "London": "12°C and cloudy",
        "Tokyo": "22°C and sunny"
    }
    return weather_data.get(location, f"Weather data not available for {location}")

agent = Agent(
    name="weather_agent",
    model="openai/gpt-4.1",
    system_prompt="You are a helpful weather assistant.",
    tools=[get_weather],    
)

Running Evaluations

python -m timbal.eval --fqn agent.py::agent --tests evals.yaml

Validation Configuration

evals.yaml
- name: eval_weather_advice
  description: Test agent uses weather data to provide appropriate advice
  turns:
    - input: "What's the weather like in New York?"
      output: "It's currently 15°C and raining in New York."
    - input: "Should I bring an umbrella?"
      steps:
        validators:
          not_contains:
            - name: get_weather
      output:
        content:
          validators:
            contains:
              - "yes"

How It Works

  1. First Turn (Context): The first turn has no validators. It establishes conversation history by providing the user input and expected output. The agent receives this information but it’s not validated.
  2. Second Turn (Validation): The last turn has validators for steps and output.
  3. Memory: The agent receives the entire conversation history, so it can remember what was said in previous turns and answer accordingly.

Evaluation Results

Successful Validation

When the agent remembers context and provides the correct response:
summary.json
{
  "total_files": 1,
  "total_tests": 1,
  "total_validations": 2,
  "inputs_passed": 0,
  "inputs_failed": 0,
  "outputs_passed": 1,
  "outputs_failed": 0,
  "steps_passed": 1,
  "steps_failed": 0,
  "execution_errors": 0,
  "tests_failed": []
}

Failed Validation

When the agent doesn’t use context or provides an incorrect response:
summary.json
{
  "total_files": 1,
  "total_tests": 1,
  "total_validations": 2,
  "inputs_passed": 0,
  "inputs_failed": 0,
  "outputs_passed": 0,
  "outputs_failed": 1,
  "steps_passed": 1,
  "steps_failed": 0,
  "execution_errors": 0,
  "tests_failed": [
    {
      "test_name": "eval_weather_advice",
      "test_path": "evals.yaml::eval_weather_advice",
      "input": {
        "prompt": [
          "Should I bring an umbrella?"
        ]
      },
      "reason": [
        "output"
      ],
      "execution_error": null,
      "input_passed": null,
      "input_explanations": [],
      "output_passed": false,
      "output_explanations": [
        "Validator contains: Message does not contain 'yes'."
      ],
      "actual_output": {
        "text": "I can help you decide! Could you please tell me your location so I can check the weather report for you?",
        "files": []
      },
      "expected_output": {
        "content": {
          "validators": {
            "contains": [
              "yes"
            ]
          }
        }
      },
      "steps_passed": true,
      "steps_explanations": [],
      "actual_steps": [],
      "expected_steps": {
        "not_contains": [
          {
            "name": "get_weather"
          }
        ]
      }
    }
  ]
}

Key Features

  • Context Memory: Previous turns establish conversation history that agents can access
  • Last Turn Validation: Only the final turn is validated; earlier turns provide context
  • Tool Usage Control: Verify agents don’t call tools unnecessarily when context is available
  • Response Continuity: Ensure agents build logically on previous interactions