Skip to main content
Transcribe audio files in a pre_hook before processing.
This example uses OpenAI’s transcription API, but you can use any transcription service (ElevenLabs, Google Cloud, or custom implementations).
The [Audio] prefix is added to the transcribed text to clearly indicate that the content originated from an audio file. This helps the agent understand the context and source of the information, which can be useful for:
  • Context Awareness: The agent knows the text came from audio transcription
  • Mixed Content: When combining text and audio in the same prompt, the prefix distinguishes transcribed content
  • Traceability: Makes it easier to track which parts of the conversation came from audio vs. text input
from timbal import Agent
from timbal.state import get_run_context
from timbal.types.file import File
from timbal.types.content import content_factory
import os
from openai import AsyncOpenAI

async def stt(audio_file: File) -> str:
    """Transcribe an audio file."""
    client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    transcript = await client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
    )
    return transcript.text

async def pre_hook():
    """Transcribe audio files before processing."""
    span = get_run_context().current_span()
    prompt = span.input.get("prompt")
    
    # Transcribe audio file and add prefix
    if (isinstance(prompt, File) and 
        prompt.__content_type__ and 
        prompt.__content_type__.startswith("audio/")):
        transcription = await stt(prompt)
        span.input["prompt"] = content_factory(f"[Audio]: {transcription}")

agent = Agent(
    name="AudioAgent",
    model="openai/gpt-4.1-mini",
    pre_hook=pre_hook
)

audio_file = File.validate("/path/to/recording.wav")
result = await agent(prompt=audio_file).collect()
This example uses OpenAI’s transcription API directly. For more advanced features like language detection, timestamps, and better error handling, refer to the OpenAI Audio API documentation.

Key Features

  • Pre-hook Transcription: Audio is transcribed before the agent processes it
  • Any Model: Works with any text model, not just audio-capable ones
  • Flexible Providers: Use any transcription service (OpenAI, ElevenLabs, or custom)
  • Audio Prefix: The “[Audio]” prefix clearly indicates transcribed content
  • File Support: Works with local files, URLs, and base64 data