Phantm — Real-time AI avatar & digital-twin platform · Case Study

Approach

The core architectural choice was decoupling inference from session state: each model (MuseTalk, CosyVoice, faster-whisper, Deep-Live-Cam) runs as an independent GPU-backed microservice behind a thin FastAPI gateway. This isolates latency-critical paths, lets each model scale independently on RunPod, and makes the platform resilient — one avatar service degrading can't stall the rest.

Problem

The client needed a production real-time avatar and digital-twin platform that could do voice cloning, lip sync, live camera deepfakes, and streaming speech-to-text — all behind a clean API consumable by web and mobile clients. Existing open-source stacks didn't compose, weren't documented, and had no production-grade orchestration.

How I built it

▸Designed and documented a 47-endpoint API surface covering avatars, voice, STT, live-cam sessions, and asset management.
▸Deployed each model as an independent GPU microservice on RunPod, with pre-warmed instances and autoscaling rules.
▸Built the Next.js 14 frontend against the API with streaming SSE for STT and WebSocket channels for live sessions.
▸Integrated Cloudflare R2 for cost-efficient asset storage with CDN-backed delivery.

Outcome

→A production-ready avatar platform with full API docs, deployable on demand.
→GPU inference orchestration tuned for cold-start minimization.
→End-to-end session flows proven on real client use cases.

Stack

FastAPIRunPodMuseTalk 1.5CosyVoice 2faster-whisperDeep-Live-CamCloudflare R2Next.js 14