Skip to main content
Agents can have a voice intereface. A fast voice model handles immediate user interaction and delegates complex tasks to a more powerful primary agent.

Architecture

  1. Voice Interface Agent - An interface agent that uses a low-latency, real-time audio model to handle greetings, chitchat, and simple clarifications directly.
  2. Primary Agent - The main agent with tools and all the capabilities of Autonomy agents. Handles complex questions, database lookups, and tool-based tasks.
When the voice agent receives a complex request, it says a filler phrase (like “Let me check on that”) and delegates to the primary agent. The primary agent processes the request, potentially calling tools, and returns a response that the voice agent speaks verbatim.

Create a Voice Agent

Add a voice configuration to any agent to enable voice capabilities:
images/main/main.py
from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="assistant",
    instructions="You are a helpful customer service agent.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "realtime_model": "gpt-4o-realtime-preview",
      "voice": "nova",
    },
  )


Node.start(main)
Once running, connect to your voice agent via WebSocket:
ws://localhost:8000/agents/assistant/voice
The agent also remains available via the standard HTTP API for text interactions.

Voice Configuration

The voice parameter accepts a dictionary with the following options:
OptionDescriptionDefault
realtime_modelModel for voice agent (must support realtime API)gpt-4o-realtime-preview
voiceTTS voice ID (alloy, echo, fable, onyx, nova, shimmer)echo
allowed_actionsActions the voice agent handles without delegatingSee below
instructionsCustom voice agent instructions (auto-generated if not set)None
filler_phrasesPhrases to say before delegating to primary agentSee below
input_audio_formatAudio format for input (pcm16, g711_ulaw, g711_alaw)pcm16
output_audio_formatAudio format for output (pcm16, g711_ulaw, g711_alaw)pcm16
vad_thresholdVoice Activity Detection sensitivity (0.0-1.0)0.5
vad_prefix_padding_msAudio to include before speech detection300
vad_silence_duration_msSilence duration to detect end of speech500

Default Allowed Actions

By default, the voice agent handles these interactions directly:
  • Greetings
  • Chitchat
  • Collecting information
  • Clarifications

Default Filler Phrases

Before delegating complex requests, the voice agent says one of:
  • “Just a second.”
  • “Let me check.”
  • “One moment.”
  • “Let me look into that.”
  • “Give me a moment.”
  • “Let me see.”

Customizing Behavior

Specify What Voice Agent Handles Directly

Control which interactions the voice agent handles without delegating:
images/main/main.py
from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="receptionist",
    instructions="You are a medical office receptionist.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "nova",
      "allowed_actions": [
        "greetings and introductions",
        "confirming appointment times",
        "asking for patient name",
        "basic office hour questions",
        "thanking the caller",
      ],
    },
  )


Node.start(main)

Custom Filler Phrases

Set context-appropriate filler phrases for your use case:
images/main/main.py
from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="support",
    instructions="You are a technical support agent.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "echo",
      "filler_phrases": [
        "Let me look up your account.",
        "One moment while I check that.",
        "Let me pull up that information.",
        "Just a second, I'm checking our system.",
      ],
    },
  )


Node.start(main)

VAD Settings for Responsive Interaction

Tune Voice Activity Detection for your environment:
images/main/main.py
from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="assistant",
    instructions="You are a helpful assistant.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "alloy",
      # More sensitive detection (lower threshold)
      "vad_threshold": 0.3,
      # Wait longer before considering speech ended
      "vad_silence_duration_ms": 700,
    },
  )


Node.start(main)

Voice Agents with Tools

Voice agents work seamlessly with tools. The primary agent has access to all tools and uses them when handling delegated requests:
images/main/main.py
from autonomy import Agent, Model, Node, Tool


async def lookup_order(order_id: str) -> dict:
  """Look up an order by ID."""
  # Your order lookup logic
  return {"order_id": order_id, "status": "shipped", "eta": "Tomorrow"}


async def main(node):
  await Agent.start(
    node=node,
    name="support",
    instructions="""You are a customer support agent.
    Use the lookup_order tool to find order information.""",
    model=Model("claude-sonnet-4-v1"),
    tools=[Tool(lookup_order)],
    voice={
      "voice": "nova",
      "filler_phrases": [
        "Let me look up your order.",
        "One moment, checking our system.",
      ],
    },
  )


Node.start(main)
When a user asks “Where is my order 12345?”, the flow is:
  1. Voice agent says “Let me look up your order.”
  2. Voice agent delegates to primary agent
  3. Primary agent calls lookup_order("12345")
  4. Primary agent returns “Your order has shipped and will arrive tomorrow.”
  5. Voice agent speaks the response verbatim

Voice Agents with Knowledge

Combine voice with knowledge search for intelligent Q&A:
images/main/main.py
from autonomy import Agent, Model, Node, Knowledge, KnowledgeTool, NaiveChunker


async def main(node):
  # Create knowledge base
  knowledge = Knowledge(
    name="product_docs",
    searchable=True,
    model=Model("embed-english-v3"),
    max_results=5,
    chunker=NaiveChunker(max_characters=1024),
  )

  # Add documents
  await knowledge.add_document(
    document_name="user-guide",
    document_url="https://example.com/docs/user-guide.md",
    content_type="text/markdown",
  )

  # Create agent with voice and knowledge
  await Agent.start(
    node=node,
    name="docs",
    instructions="""You are a product expert.
    Search the knowledge base to answer questions accurately.""",
    model=Model("claude-sonnet-4-v1"),
    tools=[KnowledgeTool(knowledge=knowledge, name="search_docs")],
    voice={
      "voice": "shimmer",
      "filler_phrases": [
        "Let me search the docs for that.",
        "One moment, I'll look that up.",
      ],
    },
  )


Node.start(main)

Memory Isolation

Voice sessions support the same memory isolation as text conversations. Pass scope and conversation parameters when connecting:
# WebSocket connection with scope and conversation
ws://localhost:8000/agents/assistant/voice?scope=user-123&conversation=session-456
This ensures each user’s voice conversation history is isolated.

Complete Example: Software Engineering Interviewer

This example demonstrates a voice agent that conducts first-round screening interviews for software engineering candidates. The agent assesses technical fundamentals, problem-solving ability, and communication skills.
from autonomy import Node, Agent, Model


INSTRUCTIONS = """
You are an experienced software engineering interviewer conducting first-round
screening interviews. Your goal is to assess candidates on technical fundamentals,
problem-solving ability, and communication skills.

Interview structure:
1. Brief introduction and put the candidate at ease
2. Ask about their background and experience (2-3 minutes)
3. Technical questions appropriate to their level (10-15 minutes)
4. Behavioral questions about teamwork and challenges (5 minutes)
5. Answer any questions they have about the role

Guidelines:
- Be warm and professional to help candidates perform their best
- Ask follow-up questions to understand their thought process
- Probe deeper if answers are surface-level
- Give hints if they're stuck, but note that you did
- Keep responses concise since this is a voice conversation
- Adapt difficulty based on their stated experience level

Technical topics to cover:
- Data structures and algorithms fundamentals
- System design basics (for senior candidates)
- Language-specific questions based on their background
- Problem-solving approach and debugging strategies

After the interview, provide a brief summary of strengths and areas for improvement.
"""


async def main(node: Node):
  await Agent.start(
    node=node,
    name="interviewer",
    instructions=INSTRUCTIONS,
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "alloy",
      "allowed_actions": [
        "greetings and introductions",
        "small talk to put candidate at ease",
        "clarifying questions",
        "acknowledging responses",
      ],
      "filler_phrases": [
        "Let me think about that.",
        "That's a good point.",
        "Interesting, let me follow up.",
        "One moment.",
      ],
    },
  )


Node.start(main)

Using the Interviewer

Connect via WebSocket for voice:
ws://localhost:8000/agents/interviewer/voice
Or use HTTP for text:
curl
curl --request POST \
  --header "Content-Type: application/json" \
  --data '{"message": "Hi, I am ready to start the interview."}' \
  "https://${CLUSTER}-${ZONE}.cluster.autonomy.computer/agents/interviewer"
The interviewer will:
  1. Greet the candidate and explain the interview format
  2. Ask about their background and experience
  3. Pose technical questions adapted to their level
  4. Explore behavioral scenarios
  5. Answer questions about the role
  6. Provide feedback on their performance