Approach
The core architectural choice was decoupling inference from session state: each model (MuseTalk, CosyVoice, faster-whisper, Deep-Live-Cam) runs as an independent GPU-backed microservice behind a thin FastAPI gateway. This isolates latency-critical paths, lets each model scale independently on RunPod, and makes the platform resilient — one avatar service degrading can't stall the rest.
Problem
The client needed a production real-time avatar and digital-twin platform that could do voice cloning, lip sync, live camera deepfakes, and streaming speech-to-text — all behind a clean API consumable by web and mobile clients. Existing open-source stacks didn't compose, weren't documented, and had no production-grade orchestration.
How I built it
- ▸Designed and documented a 47-endpoint API surface covering avatars, voice, STT, live-cam sessions, and asset management.
- ▸Deployed each model as an independent GPU microservice on RunPod, with pre-warmed instances and autoscaling rules.
- ▸Built the Next.js 14 frontend against the API with streaming SSE for STT and WebSocket channels for live sessions.
- ▸Integrated Cloudflare R2 for cost-efficient asset storage with CDN-backed delivery.
Outcome
- →A production-ready avatar platform with full API docs, deployable on demand.
- →GPU inference orchestration tuned for cold-start minimization.
- →End-to-end session flows proven on real client use cases.
Stack