Use this file to discover all available pages before exploring further.
To build a scalable video transcription product, you need to handle large file uploads,
manage socket connections for real-time progress, integrate speech-to-text and language models,
wrangle rate-limits, etc. As you scale to many users, infrastructure becomes the hardest
art: you must orchestrate jobs, scale horizontally, balance load across model providers, pool
connections, and handle failures gracefully.Most teams spend more time building infrastructure than focusing on their product.This guide walks through building a highly scalable agentic product that automatically
transcribes videos and translates transcripts to English.
+----------------------+ +----------------------+ +----------------------+| Video File |------>| Extract Audio |------>| Transcribe || (.mp4, .mov) | | (ffmpeg) | | and Diarize |+----------------------+ +----------------------+ +-----------+----------+ | +-------------------------------------------------------------+ | v+----------------------+ +----------------------+ +----------------------+| Match Speakers | | Translate to English | | Results || Across Chunks |------>| (Agent) |------>| (transcript + || (Agent) | | | | translation) |+----------------------+ +----------------------+ +----------------------+
The Autonomy Computer provides all of this infrastructure.
You get access to speech-to-text models like gpt-4o-transcribe-diarize
for transcription with speaker identification, language models for translation agents,
built-in HTTP and WebSocket servers for uploads with streaming progress,
and a runtime that handles deployment and scaling automatically.
Find the folder ID in the Box web UI — it’s the ID in the URL when viewing a folder.
3
Redeploy
/dev/null/terminal.sh
autonomy
The service automatically creates a Box webhook when it starts. Videos uploaded to the configured folder are processed and results are uploaded back as markdown files.
The service uses gpt-4o-transcribe-diarize for transcription with automatic speaker identification:
images/main/transcribe.py
async def transcribe_audio(audio_path: str, use_diarization: bool = True) -> tuple: """Transcribe audio file using GPT-4o with diarization.""" if use_diarization: model = Model("gpt-4o-transcribe-diarize") with open(audio_path, "rb") as audio_file: result = await model.speech_to_text( audio_file=audio_file, language=None, # Auto-detect language model="gpt-4o-transcribe-diarize", response_format="diarized_json", chunking_strategy="auto", ) transcript = format_diarized_transcript(result) return transcript, result
The diarization model returns segments with speaker labels:
Speaker A: Welcome to today's interview. I'm here with Dr. Lopez.Speaker B: Thank you for having me. I'm excited to discuss our research.Speaker A: Let's start with the basics. What inspired this project?
When a video is split into chunks, the same speaker may get different labels in each chunk. An agent analyzes the transcript to unify speaker labels:
images/main/speakers.py
SPEAKER_MATCHER_INSTRUCTIONS = """You are an expert at analyzing transcripts to identify and match speakers.You will receive a transcript transcribed in multiple chunks. Each chunk hasits own speaker labels (e.g., "Speaker 1A", "Speaker 1B" for chunk 1).Analyze the transcript and return ONLY a JSON mapping of speaker labels.CLUES TO LOOK FOR:- Self-introductions: "I'm Maria", "My name is Dr. Lopez"- Being addressed by name: "Thank you, Maria"- Role indicators: "As the host...", "In my research..."- Speaking patterns: Who asks questions (host) vs who answers (guest)OUTPUT FORMAT (JSON only):{ "mapping": { "Speaker 1A": "Maria (Host)", "Speaker 1B": "Dr. Lopez (Guest)", "Speaker 2A": "Maria (Host)" }, "confidence": "high", "notes": "Maria identified as host from introduction."}"""
The mapping is applied programmatically for efficiency:
images/main/speakers.py
async def match_speakers_across_chunks(transcript: str) -> str: """Unify speaker labels across transcript chunks using LLM analysis.""" model = Model("claude-sonnet-4-5") messages = [ {"role": "system", "content": SPEAKER_MATCHER_INSTRUCTIONS}, { "role": "user", "content": f"Analyze this transcript and return a JSON speaker mapping:\n\n{transcript}", }, ] response = await model.complete_chat(messages, stream=False) result = response.choices[0].message.content.strip() # Parse the JSON mapping mapping_data = extract_json_from_response(result) mapping = mapping_data.get("mapping", {}) # Apply mapping to transcript return apply_speaker_mapping(transcript, mapping)
The translator agent converts transcripts to English while preserving speaker labels:
images/main/translate.py
TRANSLATOR_INSTRUCTIONS = """You are a professional translator. You will receive transcribed text that may bein any language. The transcription includes speaker labels.Your job is to:1. Identify the source language2. Translate the text accurately to English3. PRESERVE all existing speaker labels exactly as they appearOutput format:---Source Language: [detected language]Translation:[speaker label]: [translated text]---CRITICAL RULES:- PRESERVE existing speaker labels EXACTLY as they appear in the input- DO NOT modify, rename, or add any speaker labels- Maintain natural, fluent English while preserving the original meaning"""async def initialize(node: Node) -> bool: """Initialize the translator agent.""" global _agent _agent = await Agent.start( node=node, name="translator", instructions=TRANSLATOR_INSTRUCTIONS, model=Model("claude-sonnet-4-5"), ) return True
For long transcripts, the translator processes chunks at speaker boundaries:
images/main/translate.py
def split_transcript_into_chunks( text: str, max_chars: int = 15000) -> list[str]: """Split transcript into chunks at speaker boundaries.""" if len(text) <= max_chars: return [text] chunks = [] current_chunk = [] current_length = 0 # Split on double newlines (speaker boundaries) segments = text.split("\n\n") for segment in segments: segment_length = len(segment) + 2 if current_length + segment_length > max_chars and current_chunk: chunks.append("\n\n".join(current_chunk)) current_chunk = [] current_length = 0 current_chunk.append(segment) current_length += segment_length if current_chunk: chunks.append("\n\n".join(current_chunk)) return chunks
The service queues videos to process one at a time, preventing resource exhaustion:
images/main/jobs.py
@dataclassclass ProcessingJob: """Represent a video processing job in the queue.""" id: str filename: str status: str # "queued", "processing", "complete", "error" progress: int queued_at: datetimeprocessing_lock = asyncio.Lock()current_job: Optional[ProcessingJob] = Nonejob_queue: deque[ProcessingJob] = deque()
Clients waiting in the queue receive position updates:
images/main/upload.py
while job.status == "queued": await asyncio.sleep(2) async with processing_lock: position = get_job_position(job_id) if position: await send_json({ "type": "queue_update", "position": position, "message": f"You are #{position} in queue.", })