A curated, developer-friendly learning path for building real-time voice AI agents, from your first STT call to scaling production telephony.
English Β· δΈζηζ¬
Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text β LLM β text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order: start with the foundations, pick a framework, then drill into individual components and production concerns.
Learning resources are tagged π’ Beginner, π‘ Intermediate, or π΄ Advanced (blogs, podcasts, and communities in sections 17-19 are intentionally left untagged). Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.
Read top-to-bottom if you're brand new. The recommended path:
- Foundations β understand the pipeline and latency budget
- Frameworks β pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
- Components (STT, TTS, LLM, VAD, turn detection) β swap pieces to learn what each layer does
- Transport & telephony β connect to a real phone number
- Evaluation, production, ethics β make it safe enough to ship
If you want this material in a tighter, opinionated, production-grade form, I wrote the Voice Agents Handbook: building production voice AI with LiveKit, plus appendices on choosing your stack and the LiveKit ecosystem beyond agents. Available now on Kindle (and in paperback).
The README you're reading collects the field's best free resources. The book is the curated path through them, with the patterns I've used shipping voice agents for trade people, lawyers, and immigration consultants.
Disclosure: I maintain this repo and authored the handbook. Free sample (Introduction + Chapter 1) at handbook.mahimai.ca.
π Expand the 21 sections
- Foundational concepts and learning paths
- Frameworks and orchestration platforms
- Speech-to-text (STT / ASR)
- Text-to-speech (TTS)
- LLMs for voice and real-time AI
- Voice activity detection and turn-taking
- Audio enhancement and noise suppression
- WebRTC fundamentals
- Telephony and SIP
- Tutorials and hands-on projects
- GitHub starter repos and awesome lists
- Datasets and benchmarks
- Beginner-accessible research papers
- Evaluation and testing
- Production, deployment, and scaling
- Ethics, safety, and regulation
- Blogs and newsletters
- Podcasts
- Communities
- Conferences and events
- Hackathons and competitions
Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you'll fight for the rest of your career.
- Voice AI & Voice Agents: An Illustrated Primer: Kwindla Hultman Kramer's free, regularly-updated long-form primer. The de facto textbook for the field. π’ Beginner
- Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained (LiveKit): Visual walkthrough of streaming patterns, turn detection, and where latency accumulates. π’ Beginner
- Everything You Need to Know About Voice AI Agents (Deepgram): End-to-end primer covering feature extraction, ASR, LLM reasoning, and synthesis. π’ Beginner
- AI Voice Agents (LiveKit Docs): The canonical "what is a voice agent" reference, covering the Agents framework, sessions, and the STT-LLM-TTS pipeline vs realtime model split. π’ Beginner
- Core Latency in AI Voice Agents (Twilio): Visual explanation of end-of-turn detection, silence thresholds, and smart endpointing. π’ Beginner
- Advice on Building Voice AI in June 2025 (Daily.co): Practical P50/P95 latency-budget guidance from Pipecat's creators. π‘ Intermediate
- How Intelligent Turn Detection Solves the Biggest Challenge in Voice Agents (AssemblyAI): Endpointing is the most underestimated problem; this is the clearest deep-dive. π‘ Intermediate
The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.
- LiveKit Agents: Voice AI Quickstart: Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. π’ Beginner
- Pipecat: Quickstart: Scaffolds a Deepgram + OpenAI + Cartesia pipeline via the Pipecat CLI (
uv tool install pipecat-ai-cli, thenpipecat init quickstart); talk to it in the browser in ~5 minutes. π’ Beginner - Ultravox (fixie-ai/ultravox): Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. π΄ Advanced
- Vapi: Quickstart: Dashboard-first; ship an agent on a free US phone number in under 5 minutes. π’ Beginner
- Retell AI: Introduction & Quickstart: Phone-agent platform with $10 free credit on signup. π’ Beginner
- Bland AI: Send Your First Phone Call: Minimal API tutorial for placing your first AI phone call. π’ Beginner
- ElevenLabs Agents: Quickstart: Build and embed a voice agent widget on any website in 5 minutes (formerly branded "Conversational AI," now ElevenAgents). π’ Beginner
- OpenAI Realtime API: Guide: Official guide to
gpt-realtime(now GA) over WebRTC, WebSockets, or SIP. π‘ Intermediate - Google Gemini Live API: Overview: Low-latency, bidirectional voice + vision agents with barge-in and tool use, on Gemini 3 native audio. π‘ Intermediate
- Twilio ConversationRelay: WebSocket bridge that handles STT/TTS so you focus on LLM logic; works with any LLM. π‘ Intermediate
- Vapi vs Pipecat vs LiveKit (AssemblyAI): Architecture-focused comparison of pipeline control and transport choices. π‘ Intermediate
- 11 Voice Agent Platforms Compared (Softcery): Broad market map with use-case recommendations. π’ Beginner
- Best Voice Agent Stack (Hamming AI): Buy-vs-build framework with concrete cost, latency, and time-to-launch numbers. π‘ Intermediate
Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases. (All-in-one ASR + end-of-turn models like Deepgram Flux are covered under turn-taking.)
- Deepgram Nova-3: STT benchmarks: Primer on WER, latency, and cost alongside Deepgram's product reference; Nova-3 now spans 36+ languages with multilingual code-switching. π’ Beginner
- AssemblyAI Universal-3 Pro: Streaming STT walkthrough that doubles as a function-calling tutorial; Universal-3 Pro is the current flagship, adding natural-language keyterm prompting. π‘ Intermediate
- OpenAI Whisper / gpt-4o-transcribe API docs: Easiest cloud STT if you already use OpenAI. π’ Beginner
- Soniox multilingual benchmark: Public WER comparison across 60 languages. π’ Beginner
- Cartesia Ink 2: Streaming STT paired with Sonic TTS for a single-vendor low-latency stack. π’ Beginner
- openai/whisper: The original repo and the de facto starting point for any DIY ASR project. π’ Beginner
- SYSTRAN/faster-whisper: CTranslate2 reimplementation up to 4Γ faster with INT8; recommended for self-hosted Whisper. π‘ Intermediate
- NVIDIA NeMo (Parakeet / Canary): Top-of-leaderboard open ASR models with streaming inference recipes. π΄ Advanced
- Moonshine: Tiny on-device ASR (tiny 27M / base 61M params); v2 adds an ergodic streaming encoder built for latency-critical live transcription on edge devices. π‘ Intermediate
- Open ASR Leaderboard (HuggingFace): Community leaderboard across 11 datasets: your reference for open-source picks. π’ Beginner
- Artificial Analysis: Speech-to-Text: Independent leaderboard ranking 48+ STT providers by WER, speed, and cost. π’ Beginner
- Best Speech-to-Text Providers in 2026 (Coval): Independent benchmark across 14 providers (WER, latency, end-of-turn, cost), with guidance on testing against your own traffic. π‘ Intermediate
- Best Speech-to-Text APIs in 2026 (Deepgram): Provider comparison guide; note the commercial author. π’ Beginner
- Streaming vs Batch ASR (Arun Baby): Engineer-friendly explainer of RNN-T and Conformer streaming architectures. π‘ Intermediate
Latency, not raw quality, is what kills voice agents: prioritize providers offering true streaming with first-byte under 200 ms.
- ElevenLabs Docs: Industry-leading quality, voice cloning, and Agents platform in one SDK. π’ Beginner
- Cartesia Sonic Quickstart: Sonic 3.5, sub-90 ms first-byte latency, designed specifically for voice agents. π’ Beginner
- Deepgram Aura-2: Low-latency streaming TTS (Aura-2) that pairs cleanly with Deepgram STT. π’ Beginner
- OpenAI TTS (gpt-4o-mini-tts): Easiest plug-in TTS for the OpenAI stack. π’ Beginner
- Artificial Analysis: TTS leaderboard: ELO, price, and speed comparison covering Rime, PlayHT, Hume, Inworld, and others. π’ Beginner
- Chatterbox (resemble-ai/chatterbox): Resemble AI's MIT-licensed TTS that beats ElevenLabs in blind preference tests; ~5 s zero-shot voice cloning, emotion-exaggeration control, and a built-in PerTh watermark. Turbo variant (350M) hits sub-150 ms first audio; Multilingual (V3, 0.5B) covers 23+ languages. π‘ Intermediate
- Kokoro 82M: Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. π’ Beginner
- Piper (OHF-Voice/piper1-gpl): Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. π’ Beginner
- Coqui TTS (idiap fork): Maintained fork of Coqui-TTS / XTTS v2; still battle-tested, though Chatterbox now leads on zero-shot cloning quality. π‘ Intermediate
- Orpheus-TTS: Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. π‘ Intermediate
- Sesame CSM: Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. π΄ Advanced
- Streaming TTS for Low-Latency Agents (Picovoice): Clear taxonomy of single, output-streaming, and dual-streaming TTS. π‘ Intermediate
- Ethics of Voice Cloning & Deepfakes (Deepgram): Vendor-neutral discussion of misuse, regulation, and developer responsibility. π’ Beginner
A voice agent's perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.
- Groq: LPU-based inference cloud delivering ~10Γ faster Llama tokens/sec than commodity GPUs. π’ Beginner
- Cerebras Inference: Wafer-scale chip inference with very high throughput on Llama models. π’ Beginner
- SambaNova Cloud: Reconfigurable Dataflow inference; stable throughput at low latency. π’ Beginner
- OpenAI Realtime API guide: Flagship S2S product with WebRTC/WebSocket transport (
gpt-realtime, now GA). π‘ Intermediate - Google Gemini Live: Real-time multimodal voice/video with barge-in and broad language support, on Gemini 3 native audio. π‘ Intermediate
- Moshi (kyutai-labs): Open full-duplex speech-text foundation model (~200 ms, Mimi codec). Kyutai's broader stack now includes Unmute (cascaded STT+LLM+TTS with tool use), Kyutai STT/TTS, and Hibiki (streaming translation). π΄ Advanced
- Speech-to-Speech Models in 2026: Three Architectural Bets (Krzysztof Sopyla): Vendor-neutral comparison of full-duplex (Moshi), near-duplex multimodal (Qwen-Omni), and cascade approaches, with FullDuplexBench numbers and tradeoffs. π‘ Intermediate
- OpenAI Voice Agents Guide: Compares chained vs S2S architectures with prompt and tool best practices. π’ Beginner
- ElevenLabs Voice Agent Prompting Guide: Production-grade prompt structure tuned for voice; vendor-neutral lessons. π‘ Intermediate
- Voice AI Prompt Engineering Guide (VoiceInfra): Explains why voice prompts must be 60β70% shorter than chat prompts, with templates. π’ Beginner
- Tool Definition and Use for Voice Agents (LiveKit Docs): Defining
@function_tooltools and raw-schema tools inside a voice agent. π‘ Intermediate
Pure VAD is no longer enough: modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.
- Silero VAD: MIT-licensed pre-trained VAD; <1 ms per chunk on CPU. The de facto VAD inside LiveKit and Pipecat. π’ Beginner
- py-webrtcvad: Python bindings for Google's classic WebRTC VAD; lightweight baseline. π’ Beginner
- LiveKit Turn Detector: blog post: How a small transformer-based EOU model complements VAD with semantic context. π‘ Intermediate
- LiveKit turn-detector model on HuggingFace: Open-weights multilingual EOU model running ONNX on CPU in under 500 MB. π‘ Intermediate
- Deepgram Flux: All-in-one conversational STT with built-in end-of-turn detection (median EOT <300 ms), integrated with Deepgram's Voice Agent API; collapses STT and turn detection into a single model. π‘ Intermediate
- Pipecat Smart Turn v3: Whisper-Tiny-based audio semantic VAD with fast CPU inference (~12 ms on a standard instance per the v3 repo), BSD-2 licensed. π‘ Intermediate
- pipecat-ai/smart-turn: Repo with model code, training scripts, and integration examples (~8M params, Whisper-Tiny base). π‘ Intermediate
- Krisp Turn-Taking: Commercial turn-taking model used alongside any STT/LLM/TTS stack. π‘ Intermediate
- The Complete Guide to AI Turn-Taking (Tavus): Reader-friendly overview of why pure VAD fails in real conversations. π’ Beginner
- Tackling Turn Detection in Voice AI (Notch): Engineer-first walkthrough combining VAD probability, volume, and TTS markers. π‘ Intermediate
- Evaluating End-of-Turn Detection Models (Deepgram): Methodology plus a head-to-head of Flux, Pipecat Smart Turn, and LiveKit EOU; note the commercial author. π‘ Intermediate
- ai-coustics VAD: VAD bundled with real-time speech enhancement, noise suppression, and voice isolation in a single audio preprocessing SDK; useful when you want cleanup and turn-taking signals from the same component. π’ Beginner
The audio reaching your VAD and STT is often noisy, reverberant, or mixed with background voices. Cleaning the signal before the rest of the pipeline is frequently the difference between an agent that ships and one that frustrates users in real-world conditions (cars, cafΓ©s, call centres). In 2026 every major voice-AI vendor ships a deep-learning suppressor on top of WebRTC's classic noise-suppression chain.
- ai-coustics: Real-time speech enhancement SDK covering noise cancellation, voice isolation, and VAD; on-device and cloud deployment. See the docs and developer platform. π’ Beginner
- Krisp SDK: Commercial-grade real-time noise and background-voice cancellation; the de facto standard for voice comms (Python, Node.js, Go, C++ SDKs). LiveKit's background voice cancellation and Pipecat Cloud both build on Krisp. Enterprise access via contact form. π’ Beginner
- DeepFilterNet (Rikorose/DeepFilterNet): Open-source, low-complexity real-time speech enhancement for full-band audio; designed to run on embedded devices. The strongest actively-developed OSS noise suppressor. π‘ Intermediate
- RNNoise (xiph/rnnoise): Classic hybrid DSP + deep-learning noise suppression; a tiny, well-understood baseline, but no longer actively maintained. π‘ Intermediate
- Koala Noise Suppression (Picovoice): On-device, cross-platform voice isolation with self-serve access (browser, mobile, desktop, Raspberry Pi). π’ Beginner
- Noise Suppression Guide 2026 (Picovoice): Algorithms, intelligibility metrics (SII / STI / STOI), and implementation tradeoffs; note the commercial author. π‘ Intermediate
WebRTC is the default transport for voice agents that don't run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.
- MDN WebRTC API: Authoritative free reference for
RTCPeerConnection,getUserMedia, and signaling. π’ Beginner - MDN: Introduction to WebRTC Protocols: Beginner-friendly explanation of ICE, STUN, TURN, and SDP. π’ Beginner
- WebRTC.org Getting Started: Official Google-maintained intro, splitting WebRTC into media-capture and connectivity. π’ Beginner
- GetStream: WebRTC for the Brave: Free multi-module tutorial covering networking basics through advanced topics. π’ Beginner
- Why WebRTC Beats WebSockets for Voice AI (LiveKit): 2025 explainer aimed at AI builders, comparing transports in plain English. π‘ Intermediate
- Daily Docs: Intro to Video Architecture (P2P vs SFU): One of the clearest beginner write-ups of P2P vs SFU. π’ Beginner
- P2P, SFU, MCU, Hybrid: WebRTC Architecture Guide (Forasoft): Vendor-neutral 2026 breakdown of the four architectures with current OSS tooling (mediasoup, Janus, Jitsi). π‘ Intermediate
- Agora: How WebRTC Works: Side-by-side WebRTC vs WebSockets walkthrough with signaling diagrams. π’ Beginner
The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.
- Twilio Programmable Voice: TwiML, Voice API, and PSTN connectivity in one hub; the default starting point. π’ Beginner
- Twilio: Voice AI Assistant with OpenAI Realtime + Python: Step-by-step junior-friendly tutorial wiring Twilio Media Streams to an LLM. π’ Beginner
- Twilio SIP Quickstart: Clearest beginner explainer of SIP basics, SIP Domains, and softphone setup. π’ Beginner
- Telnyx Voice API: Strong Twilio alternative with WebSocket media streaming and AI Assistant tooling. π’ Beginner
- Telnyx: How to Set Up a SIP Trunk: Friendly walkthrough of SIP trunking architecture, codecs, and authentication. π’ Beginner
- Plivo Voice API Documentation: XML call control and audio-streaming integrations for AI agents. π’ Beginner
- SignalWire Voice Docs: Built on FreeSWITCH; SWML, TwiML-compatible API, and an AI Agents SDK. π‘ Intermediate
- LiveKit SIP Primer: Best diagram of how a call flows from PSTN β trunk β SIP service β agent. π’ Beginner
- LiveKit SIP Trunk Setup: Practical guide for wiring Twilio/Telnyx/Plivo/Wavix/Sinch trunks into LiveKit. π‘ Intermediate
- Pipecat Telephony Overview: Differences between WebSocket-based telephony and SIP-based call control. π‘ Intermediate
Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.
- LiveKit Voice AI Quickstart: Official 10-minute walkthrough in Python or Node with starter templates. π’ Beginner
- Build Your First AI Voice Agent in Python (LiveKit): End-to-end Python tutorial covering streaming, latency, and deployment. π’ Beginner
- Pipecat Quickstart: Build and deploy a Deepgram + OpenAI + Cartesia bot via the Pipecat CLI in roughly 10 minutes. π’ Beginner
- How to Build a Real-Time Voice Agent with Pipecat (AssemblyAI): Production-oriented walkthrough including local testing and Pipecat Cloud deployment. π‘ Intermediate
- Build a Voice Agent with LiveKit (AssemblyAI): End-to-end walkthrough wiring LiveKit Agents + AssemblyAI Universal-3 Pro + Cartesia, run locally then on the Agents Playground. π‘ Intermediate
- Deepgram: Build a Voice AI Agent: Step-by-step guide wiring Deepgram STT, GPT, and Aura TTS. π’ Beginner
- Build a Voice Assistant with Twilio ConversationRelay + LiteLLM: Provider-agnostic tutorial supporting OpenAI, Anthropic, or DeepSeek. π‘ Intermediate
- freeCodeCamp: Build Advanced AI Agents (LiveKit, Exa, LangChain): Free 3-part video course covering interactive voice agents end-to-end. π’ Beginner
- freeCodeCamp: Build a Voice AI Agent with Open-Source Tools: Hands-on local stack covering open-source STT, a local LLM, and system TTS, plus the cascaded vs end-to-end tradeoff. π‘ Intermediate
Clone these instead of writing boilerplate from scratch.
- livekit/agents: The flagship open-source Python/Node framework for production voice agents (tip: pair it with the LiveKit Docs MCP server and Agent Skill for AI-assisted builds). π’ β π΄
- pipecat-ai/pipecat: Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. π’ β π΄
- livekit-examples/agent-starter-python: Production-ready starter with Dockerfile, eval suite, turn detector, and core plugins. π’ Beginner
- livekit-examples (org): Official collection of LiveKit Python/React/Swift/Android starters. π’ Beginner
- pipecat-ai/pipecat-examples: Sample apps for push-to-talk, websocket, telephony, and multimodal use cases. π’ β π‘
- elevenlabs/elevenlabs-examples: Runnable Next.js and Python examples for TTS, STT, and real-time agents. π’ Beginner
- kwindla/macos-local-voice-agents: Pipecat example hitting sub-800 ms voice-to-voice latency entirely on M-series Macs. π‘ Intermediate
- zzw922cn/awesome-speech-recognition-speech-synthesis-papers: Comprehensive curated index of ASR, TTS, voice conversion, and speech-LLM papers. π‘ Intermediate
- wildminder/awesome-ai-voice: Actively maintained 2026 list of open-source TTS, voice-cloning, and audio/music-generation models. π’ Beginner
You'll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.
- LibriSpeech ASR Corpus: ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. π’ Beginner
- Mozilla Common Voice: Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. π’ Beginner
- Common Voice on HuggingFace: One-line
load_dataset()access for hands-on experiments. The officialmozilla-foundationreleases top out around v17; newer corpus versions (up to v22) are hosted on community mirrors. π’ Beginner - Open ASR Leaderboard: Live comparison of 60+ ASR models on WER and real-time factor. π’ Beginner
- Artificial Analysis: Speech: Independent benchmarks of commercial STT and TTS providers. π’ Beginner
- LJSpeech Dataset: ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. π’ Beginner
- VCTK Corpus: ~110 English speakers with diverse accents; widely used for multi-speaker TTS. π‘ Intermediate
- VoxCeleb (Oxford VGG): Million-utterance "in the wild" dataset for speaker identification and verification. π‘ Intermediate
These are the landmark papers behind the models you'll actually use. Read the Whisper and Common Voice papers first: they're unusually approachable.
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (2022): Behind the most popular open ASR model; unusually clear prose for an ML paper. π‘ Intermediate
- HuggingFace Whisper fine-tuning blog (companion): Hands-on walkthrough that lets you "feel" the Whisper paper in code. π’ Beginner
- VITS: Conditional VAE with Adversarial Learning for End-to-End TTS (2021): The single-stage TTS model behind many open-source voice cloners. π‘ Intermediate
- Tacotron 2: Natural TTS Synthesis (2017): Landmark seq2seq + WaveNet-vocoder paper that made neural TTS sound natural. π‘ Intermediate
- Conformer: Convolution-augmented Transformer for ASR (2020): The architecture inside NVIDIA Parakeet, Canary, and many leaderboard models. π‘ Intermediate
- wav2vec 2.0: Self-Supervised Learning of Speech Representations (2020): Showed that pretraining on unlabeled audio drastically reduces labeled-data needs. π‘ Intermediate
- Common Voice: A Massively-Multilingual Speech Corpus (2020): Short, accessible paper describing how Common Voice is built and validated. π’ Beginner
- Moshi: A Speech-Text Foundation Model for Real-Time Dialogue (2024): The first real-time full-duplex spoken LLM; introduces the Mimi codec and the "Inner Monologue" method (time-aligned text before audio tokens). π΄ Advanced
- Open ASR Leaderboard preprint (2025): Reproducible benchmark of 60+ ASR models across 11 datasets; the modern landscape map. π‘ Intermediate
- Full-Duplex-Bench: Evaluating Full-Duplex Spoken Dialogue Models on Turn-Taking (2025): A reproducible benchmark for interruption handling and turn-taking in speech-to-speech models. π‘ Intermediate
You can't ship what you can't measure. Voice-agent evaluation is fundamentally probabilistic: a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.
- Coval: Voice AI Testing Platform: Defines the core voice-agent metrics: TTFB, WER, resolution rate, simulated accents, and interruptions. π’ Beginner
- Coval: How to Evaluate Voice Agents (Practical Guide): One of the most cited 2025 guides on probabilistic vs deterministic evaluation. π’ Beginner
- Cekura: Metrics Overview: Predefined metrics, instruction-following checks, and simulation framework. π’ Beginner
- Cekura: Performance Testing for Voice Agents: Practical 2025 guide on multi-turn simulation and edge-case generation. π‘ Intermediate
- Hamming AI: Production-focused QA platform with simulation, load testing, and 50+ metrics. π‘ Intermediate
- Hamming: Voice Agent Evaluation Metrics Guide: Reference of latency percentiles, WER, MOS-style quality, and task completion with formulas. π‘ Intermediate
- LiveKit: Understand and Improve Agent Latency: Per-turn latency metrics (e2e, LLM TTFT, TTS TTFB) and where to optimize. π‘ Intermediate
- Twilio: How Do You Know if Your Voice AI Agents Are Working?: Vendor-neutral 2025 guide arguing for business-outcome metrics over raw WER/latency. π’ Beginner
- Future AGI simulate-sdk: Open-source voice AI simulation SDK for testing AI agents; generates synthetic conversations for evaluation. π‘ Intermediate
- Future AGI: Open-source platform to simulate, evaluate, trace, guardrail, and optimize voice and AI agent apps in one feedback loop, with persona-driven simulation and 50+ eval metrics. π‘ Intermediate
Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.
- LiveKit: Deploy and scale agents on LiveKit Cloud: Real-world write-up on stateful load balancing, autoscaling, and warm pools. π‘ Intermediate
- LiveKit: Why You Shouldn't Build Voice Agents Directly on Model APIs: Honest breakdown of what raw model APIs don't give you. π‘ Intermediate
- Latent Space: OpenAI Realtime API: The Missing Manual: Field-tested guide from Pipecat's creator on Realtime API production realities. π‘ Intermediate
- TWIML: Building Voice AI Agents That Don't Suck (Kwindla Kramer): One-hour discussion on real production architecture and turn-taking. π‘ Intermediate
- AWS: Voice Agents with Pipecat and Amazon Bedrock: Full architecture walkthrough including latency optimization and Nova Sonic. π‘ Intermediate
- Deepgram: STT API Pricing Breakdown: Vendor-by-vendor per-minute economics: required reading before signing any contract. π’ Beginner
- Sierra: Shipping and Scaling AI Agents: Case-study on Sonos, SiriusXM, and OluKai voice deployments. π‘ Intermediate
- Sierra: Constellation of Models: How a leading CX company composes 15+ models per agent. π‘ Intermediate
- LiveKit Agent Observability: Built-in tracing, transcripts, and per-stage latency for LiveKit Cloud. π’ Beginner
If you're shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.
- FCC: AI-Generated Voices in Robocalls Illegal (Feb 2024): The landmark TCPA ruling every U.S. voice-agent dev must read. π’ Beginner
- EU AI Act: Article 50 (Transparency for Deepfakes & AI Interactions): Authoritative text of EU disclosure rules; transparency obligations apply from 2 August 2026 (systems already on the market before that date have until 2 December 2026 to comply). π‘ Intermediate
- European Commission: Code of Practice on AI-Generated Content: Official EU implementation guidance on watermarking and labelling; the finalized Code was published on 10 June 2026. π‘ Intermediate
- FTC: Approaches to Address AI-Enabled Voice Cloning: Plain-English summary of the Voice Cloning Challenge winners and Impersonation Rule. π’ Beginner
- FTC: Proposed Rule on AI Impersonation of Individuals (Feb 2024): Direct source on U.S. impersonation-fraud rules covering AI deepfakes. π’ Beginner
- Pindrop: Voice Intelligence & Security Report: Industry report documenting the sharp rise in deepfake fraud attempts. π’ Beginner
- Voice Cloning Ethics (CAMB.AI): Practical overview of consent frameworks, ELVIS Act, and EU AI Act. π’ Beginner
- NCLC: Top Six TCPA/Robocall Developments 2024/2025: Consumer-protection lens on what's actually being enforced. π‘ Intermediate
Subscribe to two or three to stay current: the field moves quickly.
- LiveKit Blog: Engineering deep-dives on WebRTC, agents framework releases, and production patterns.
- Deepgram Learn: Tutorials on STT/TTS, voice agent design, evals, and pipeline architecture.
- Cartesia Blog: State-space TTS models, Sonic releases, and yearly "State of Voice AI" reports.
- ElevenLabs Blog: Product and research announcements with implementation notes.
- Daily.co Blog (Pipecat): Posts from Pipecat's maintainers covering scaling and feature releases.
- Voice AI & Voice Agents: An Illustrated Primer: Free, regularly-updated long-form primer.
- Voice AI Space: Vendor-neutral hub for the voice AI ecosystem: a curated product and tool directory, the Voice AI Newsroom, tutorials and repos, a jobs board, and community meetups.
- Voice AI Newsletter (Krisp): "Future of Voice AI" interview series with founders.
- Voice AI Weekly (Vapi): Weekly Substack rounding up news, products, and tools.
- Deepgram AI Minds: Founder and builder interviews across the voice AI ecosystem.
- The Future of Voice AI (Krisp): Weekly founder interviews focused on enterprise voice AI architecture.
- TWIML AI Podcast: voice episodes: Strong technical interviews; the Kwin Kramer episode is a great starting point.
- This Week In Voice (Project Voice): News-roundtable format covering conversational AI.
- LiveKit Community Slack: Direct access to maintainers and other agent builders.
- Pipecat Discord: Active community with weekly office hours; invite link from the homepage.
- HuggingFace Discord: #ml-for-audio-and-speech: 200k-member server with strong audio/speech channels.
- Vapi Discord: Builder community for Vapi voice agents; invite from the homepage.
- Retell AI Community: Forum for Retell developers building phone-call voice agents.
- ElevenLabs Discord: Large TTS, voice cloning, and Conversational AI community with daily help threads.
- Deepgram Discord: STT/TTS/Voice Agent API support and build-with-us threads.
- Reddit: r/LocalLLaMA: Active threads on local Whisper/Parakeet, on-device TTS, and end-to-end voice stacks.
- Reddit: r/AI_Agents: General AI-agent community where voice topics surface frequently.
- AI Engineer World's Fair: Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. The 2026 edition runs 29 June - 2 July 2026 at Moscone West, San Francisco.
- AI Engineer YouTube channel: All World's Fair and Summit talks are posted free; the best library of recent voice-AI talks.
- AI Engineer Summit Online: Voice playlist: Curated playlist including voice-track sessions from leading labs.
- AIEWF 2025 Recap (Latent Space): Written deep-dive into 2025's voice-track talks and major launches.
- VOICE & AI (Modev): Long-running voice technology conference with broader CX and voicebot focus, happening on Oct 5β7, 2026
- Interspeech 2026: Top academic speech-science conference; intimidating but worth knowing, since most landmark papers debut here. Sydney, Australia, 27 September - 1 October 2026.
- ElevenHacks (weekly sprints): Weekly themed challenges with credits and prizes; low-pressure way to ship one project per week. π’ Beginner
- AI Engineer World's Fair Hackathon: Co-located with the conference; $10K prizes judged by 3,000+ AI engineers, with a strong voice track, happening on Jun 27 at 9:00 AM - Jun 28 at 5:00 PM (PDT). π‘ Intermediate
- lablab.ai AI Hackathons: Continuous calendar of short online hackathons frequently sponsored by voice-AI vendors. π’ Beginner
- Devpost: Voice AI Hackathons: Centralized search for active voice-AI hackathons; the best way to find what's open right now. π’ Beginner
- Week 1: Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 8).
- Week 2: First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 10).
- Week 3: Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
- Week 4: Turn-taking, audio cleanup & telephony: Add Silero VAD, a turn detector, and a speech-enhancement pass; connect a SIP trunk (sections 6, 7, 9).
- Week 5: Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 14, 15, 16).
- Ongoing: Subscribe to two newsletters, follow Voice AI Space, and join the Voice AI community on LinkedIn group (sections 17, 18, 19).
Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals. See CONTRIBUTING.md for the full contribution guide.
MIT. Fork it, ship it.