Approach
Voice agents live or die on round-trip latency. I chose Cerebras for sub-100ms LLM inference, Cartesia for neural TTS that streams the first audio chunk in under 200ms, and LiveKit for WebRTC so audio hops server→client without TCP head-of-line blocking. Each layer is individually replaceable without touching the others.
Problem
Consumer voice agents feel robotic because latency between speech-end and response-start is over a second. The goal was sub-500ms end-to-end for a natural, interruption-tolerant conversation.
How I built it
- ▸Piped live mic audio over LiveKit WebRTC into a streaming transcription worker.
- ▸Routed transcripts to Cerebras LLM inference with early-exit streaming.
- ▸Streamed generated tokens into Cartesia TTS and piped audio back over the same LiveKit channel.
- ▸Implemented barge-in detection so the user can interrupt mid-sentence.
Outcome
- →Sub-500ms measured round-trip latency in early testing.
- →Clean, swappable stack — each component is a separate module.
- →Foundation for press-to-talk integration on the Sentry portfolio itself.
Stack
CerebrasCartesiaLiveKitPythonFastAPI