The Autonomy Computer provides all of this infrastructure.
You get access to speech-to-text models like
gpt-4o-transcribe-diarize
for transcription with speaker identification, language models for translation agents,
built-in HTTP and WebSocket servers for uploads with streaming progress,
and a runtime that handles deployment and scaling automatically.
How it works
When a video is uploaded (via the web UI or Box webhook), the service:- Extracts audio - Uses ffmpeg to extract audio chunks from the video.
- Transcribes with diarization - Uses
gpt-4o-transcribe-diarizeto identify different speakers. - Matches speakers - For long videos, an agent analyzes context to unify speaker labels across chunks.
- Translates - A translator agent converts the transcript to English while preserving speaker labels.
- Returns results - Provides both the original transcript and English translation.
Quick start
1
Sign up and install the autonomy command
Complete the steps to get started with Autonomy.
2
Get the example code
/dev/null/terminal.sh
File Structure:
3
Deploy
/dev/null/terminal.sh
Configure Box integration (optional)
The service can automatically process videos uploaded to a Box folder.1
Get Box API credentials
Create a Box application at app.box.com/developers/console:
- Create a new Custom App with Server Authentication (Client Credentials Grant).
- Note your Client ID, Client Secret, and Enterprise ID.
- Authorize the application in your Box admin console.
2
Create secrets.yaml
Copy Find the folder ID in the Box web UI — it’s the ID in the URL when viewing a folder.
secrets.yaml.example and fill in your credentials:secrets.yaml
3
Redeploy
/dev/null/terminal.sh
Learn how it works
Transcription with speaker diarization
The service usesgpt-4o-transcribe-diarize for transcription with automatic speaker identification:
images/main/transcribe.py
Chunked processing for long videos
For videos longer than 5 minutes, the service processes audio in chunks to stay within API limits:images/main/transcribe.py
Speaker matching across chunks
When a video is split into chunks, the same speaker may get different labels in each chunk. An agent analyzes the transcript to unify speaker labels:images/main/speakers.py
images/main/speakers.py
Translation agent
The translator agent converts transcripts to English while preserving speaker labels:images/main/translate.py
images/main/translate.py
Box webhook integration
The service registers a webhook with Box to automatically process uploaded videos:images/main/main.py
WebSocket uploads with progress
The web interface uses WebSocket for chunked uploads with real-time progress:images/main/upload.py
Processing queue
The service queues videos to process one at a time, preventing resource exhaustion:images/main/jobs.py
images/main/upload.py
API reference
POST /webhook/box
Receives Box webhook notifications. Automatically registered when Box credentials are configured.WebSocket /ws/upload
Upload videos via WebSocket with progress updates. Messages from client:{"type": "start", "filename": "video.mp4", "size": 12345, "total_chunks": 10}- Binary chunk data
{"type": "end"}
{"type": "ready"}- Ready to receive chunks{"type": "chunk_ack"}- Chunk received{"type": "status", "message": "Processing...", "percent": 50}{"type": "result", "success": true, "original_transcript": "...", "english_translation": "..."}
GET /queue/status
Returns current processing queue status.GET /health
Health check endpoint.Learn more
Models
Available models for transcription and translation.
Agents
Build agents with custom instructions and tools.
Programming Interfaces
Create APIs for Autonomy applications.
Box Integration
Build voice agents for Box documents.
Video processing fails with memory errors
Video processing fails with memory errors
- Reduce
MAX_CHUNK_DURATIONto process smaller audio segments. - Use
size: bigin your pod configuration for more resources. - Ensure the video file isn’t corrupted.
Speaker labels are inconsistent
Speaker labels are inconsistent
- The speaker matcher works best with clear speaker introductions.
- For videos with many speakers, results may vary.
- Check the matcher’s confidence level in the logs.
Box webhook not receiving events
Box webhook not receiving events
- Verify the webhook URL is publicly accessible.
- Check Box admin console for webhook status.
- Ensure the Box app has proper permissions.
Translation is slow for long videos
Translation is slow for long videos
- Translation is chunked at ~15,000 characters per request.
- Consider using a faster model for initial testing.
- Check your zone logs for timeout errors.

