Voice

Agents can have a voice intereface. A fast voice model handles immediate user interaction and delegates complex tasks to a more powerful primary agent.

Architecture

Voice Interface Agent - An interface agent that uses a low-latency, real-time audio model to handle greetings, chitchat, and simple clarifications directly.
Primary Agent - The main agent with tools and all the capabilities of Autonomy agents. Handles complex questions, database lookups, and tool-based tasks.

When the voice agent receives a complex request, it says a filler phrase (like “Let me check on that”) and delegates to the primary agent. The primary agent processes the request, potentially calling tools, and returns a response that the voice agent speaks verbatim.

Create a Voice Agent

Add a voice configuration to any agent to enable voice capabilities:

images/main/main.py

from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="assistant",
    instructions="You are a helpful customer service agent.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "realtime_model": "gpt-4o-realtime-preview",
      "voice": "nova",
    },
  )


Node.start(main)

Once running, connect to your voice agent via WebSocket:

ws://localhost:8000/agents/assistant/voice

The agent also remains available via the standard HTTP API for text interactions.

Voice Configuration

The voice parameter accepts a dictionary with the following options:

Option	Description	Default
`realtime_model`	Model for voice agent (must support realtime API)	`gpt-4o-realtime-preview`
`voice`	TTS voice ID (`alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`)	`echo`
`allowed_actions`	Actions the voice agent handles without delegating	See below
`instructions`	Custom voice agent instructions (auto-generated if not set)	`None`
`filler_phrases`	Phrases to say before delegating to primary agent	See below
`input_audio_format`	Audio format for input (`pcm16`, `g711_ulaw`, `g711_alaw`)	`pcm16`
`output_audio_format`	Audio format for output (`pcm16`, `g711_ulaw`, `g711_alaw`)	`pcm16`
`vad_threshold`	Voice Activity Detection sensitivity (0.0-1.0)	`0.5`
`vad_prefix_padding_ms`	Audio to include before speech detection	`300`
`vad_silence_duration_ms`	Silence duration to detect end of speech	`500`

Default Allowed Actions

By default, the voice agent handles these interactions directly:

Greetings
Chitchat
Collecting information
Clarifications

Default Filler Phrases

Before delegating complex requests, the voice agent says one of:

“Just a second.”
“Let me check.”
“One moment.”
“Let me look into that.”
“Give me a moment.”
“Let me see.”

Customizing Behavior

Specify What Voice Agent Handles Directly

Control which interactions the voice agent handles without delegating:

images/main/main.py

from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="receptionist",
    instructions="You are a medical office receptionist.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "nova",
      "allowed_actions": [
        "greetings and introductions",
        "confirming appointment times",
        "asking for patient name",
        "basic office hour questions",
        "thanking the caller",
      ],
    },
  )


Node.start(main)

Custom Filler Phrases

Set context-appropriate filler phrases for your use case:

images/main/main.py

from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="support",
    instructions="You are a technical support agent.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "echo",
      "filler_phrases": [
        "Let me look up your account.",
        "One moment while I check that.",
        "Let me pull up that information.",
        "Just a second, I'm checking our system.",
      ],
    },
  )


Node.start(main)

VAD Settings for Responsive Interaction

Tune Voice Activity Detection for your environment:

images/main/main.py

from autonomy import Agent, Model, Node


async def main(node):
  await Agent.start(
    node=node,
    name="assistant",
    instructions="You are a helpful assistant.",
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "alloy",
      # More sensitive detection (lower threshold)
      "vad_threshold": 0.3,
      # Wait longer before considering speech ended
      "vad_silence_duration_ms": 700,
    },
  )


Node.start(main)

Voice Agents with Tools

Voice agents work seamlessly with tools. The primary agent has access to all tools and uses them when handling delegated requests:

images/main/main.py

from autonomy import Agent, Model, Node, Tool


async def lookup_order(order_id: str) -> dict:
  """Look up an order by ID."""
  # Your order lookup logic
  return {"order_id": order_id, "status": "shipped", "eta": "Tomorrow"}


async def main(node):
  await Agent.start(
    node=node,
    name="support",
    instructions="""You are a customer support agent.
    Use the lookup_order tool to find order information.""",
    model=Model("claude-sonnet-4-v1"),
    tools=[Tool(lookup_order)],
    voice={
      "voice": "nova",
      "filler_phrases": [
        "Let me look up your order.",
        "One moment, checking our system.",
      ],
    },
  )


Node.start(main)

When a user asks “Where is my order 12345?”, the flow is:

Voice agent says “Let me look up your order.”
Voice agent delegates to primary agent
Primary agent calls lookup_order("12345")
Primary agent returns “Your order has shipped and will arrive tomorrow.”
Voice agent speaks the response verbatim

Voice Agents with Knowledge

Combine voice with knowledge search for intelligent Q&A:

images/main/main.py

from autonomy import Agent, Model, Node, Knowledge, KnowledgeTool, NaiveChunker


async def main(node):
  # Create knowledge base
  knowledge = Knowledge(
    name="product_docs",
    searchable=True,
    model=Model("embed-english-v3"),
    max_results=5,
    chunker=NaiveChunker(max_characters=1024),
  )

  # Add documents
  await knowledge.add_document(
    document_name="user-guide",
    document_url="https://example.com/docs/user-guide.md",
    content_type="text/markdown",
  )

  # Create agent with voice and knowledge
  await Agent.start(
    node=node,
    name="docs",
    instructions="""You are a product expert.
    Search the knowledge base to answer questions accurately.""",
    model=Model("claude-sonnet-4-v1"),
    tools=[KnowledgeTool(knowledge=knowledge, name="search_docs")],
    voice={
      "voice": "shimmer",
      "filler_phrases": [
        "Let me search the docs for that.",
        "One moment, I'll look that up.",
      ],
    },
  )


Node.start(main)

Memory Isolation

Voice sessions support the same memory isolation as text conversations. Pass scope and conversation parameters when connecting:

# WebSocket connection with scope and conversation
ws://localhost:8000/agents/assistant/voice?scope=user-123&conversation=session-456

This ensures each user’s voice conversation history is isolated.

Complete Example: Software Engineering Interviewer

This example demonstrates a voice agent that conducts first-round screening interviews for software engineering candidates. The agent assesses technical fundamentals, problem-solving ability, and communication skills.

from autonomy import Node, Agent, Model


INSTRUCTIONS = """
You are an experienced software engineering interviewer conducting first-round
screening interviews. Your goal is to assess candidates on technical fundamentals,
problem-solving ability, and communication skills.

Interview structure:
1. Brief introduction and put the candidate at ease
2. Ask about their background and experience (2-3 minutes)
3. Technical questions appropriate to their level (10-15 minutes)
4. Behavioral questions about teamwork and challenges (5 minutes)
5. Answer any questions they have about the role

Guidelines:
- Be warm and professional to help candidates perform their best
- Ask follow-up questions to understand their thought process
- Probe deeper if answers are surface-level
- Give hints if they're stuck, but note that you did
- Keep responses concise since this is a voice conversation
- Adapt difficulty based on their stated experience level

Technical topics to cover:
- Data structures and algorithms fundamentals
- System design basics (for senior candidates)
- Language-specific questions based on their background
- Problem-solving approach and debugging strategies

After the interview, provide a brief summary of strengths and areas for improvement.
"""


async def main(node: Node):
  await Agent.start(
    node=node,
    name="interviewer",
    instructions=INSTRUCTIONS,
    model=Model("claude-sonnet-4-v1"),
    voice={
      "voice": "alloy",
      "allowed_actions": [
        "greetings and introductions",
        "small talk to put candidate at ease",
        "clarifying questions",
        "acknowledging responses",
      ],
      "filler_phrases": [
        "Let me think about that.",
        "That's a good point.",
        "Interesting, let me follow up.",
        "One moment.",
      ],
    },
  )


Node.start(main)

Using the Interviewer

Connect via WebSocket for voice:

ws://localhost:8000/agents/interviewer/voice

Or use HTTP for text:

curl

curl --request POST \
  --header "Content-Type: application/json" \
  --data '{"message": "Hi, I am ready to start the interview."}' \
  "https://${CLUSTER}-${ZONE}.cluster.autonomy.computer/agents/interviewer"

The interviewer will:

Greet the candidate and explain the interview format
Ask about their background and experience
Pose technical questions adapted to their level
Explore behavioral scenarios
Answer questions about the role
Provide feedback on their performance

GET STARTED

APPLICATIONS

AGENTS

TOOLS

GUIDES

Architecture

Create a Voice Agent

Voice Configuration

Default Allowed Actions

Default Filler Phrases

Customizing Behavior

Specify What Voice Agent Handles Directly

Custom Filler Phrases

VAD Settings for Responsive Interaction

Voice Agents with Tools

Voice Agents with Knowledge

Memory Isolation

Complete Example: Software Engineering Interviewer

Using the Interviewer

GET STARTED

APPLICATIONS

AGENTS

TOOLS

GUIDES

​Architecture

​Create a Voice Agent

​Voice Configuration

​Default Allowed Actions

​Default Filler Phrases

​Customizing Behavior

​Specify What Voice Agent Handles Directly

​Custom Filler Phrases

​VAD Settings for Responsive Interaction

​Voice Agents with Tools

​Voice Agents with Knowledge

​Memory Isolation

​Complete Example: Software Engineering Interviewer

​Using the Interviewer

Architecture

Create a Voice Agent

Voice Configuration

Default Allowed Actions

Default Filler Phrases

Customizing Behavior

Specify What Voice Agent Handles Directly

Custom Filler Phrases

VAD Settings for Responsive Interaction

Voice Agents with Tools

Voice Agents with Knowledge

Memory Isolation

Complete Example: Software Engineering Interviewer

Using the Interviewer