Skip to content

mahimairaja/voiceai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Voice AI: a curated learning path for building real-time voice agents

A curated, developer-friendly learning path for building real-time voice AI agents, from your first STT call to scaling production telephony.

Awesome License: MIT Stars Last commit Resources PRs welcome

English Β· δΈ­ζ–‡η‰ˆζœ¬

Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text β†’ LLM β†’ text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order: start with the foundations, pick a framework, then drill into individual components and production concerns.

Learning resources are tagged 🟒 Beginner, 🟑 Intermediate, or πŸ”΄ Advanced (blogs, podcasts, and communities in sections 17-19 are intentionally left untagged). Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.


How to use this list

Read top-to-bottom if you're brand new. The recommended path:

  1. Foundations β†’ understand the pipeline and latency budget
  2. Frameworks β†’ pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
  3. Components (STT, TTS, LLM, VAD, turn detection) β†’ swap pieces to learn what each layer does
  4. Transport & telephony β†’ connect to a real phone number
  5. Evaluation, production, ethics β†’ make it safe enough to ship

πŸ“˜ Companion book: Voice Agents Handbook

If you want this material in a tighter, opinionated, production-grade form, I wrote the Voice Agents Handbook: building production voice AI with LiveKit, plus appendices on choosing your stack and the LiveKit ecosystem beyond agents. Available now on Kindle (and in paperback).

The README you're reading collects the field's best free resources. The book is the curated path through them, with the patterns I've used shipping voice agents for trade people, lawyers, and immigration consultants.

Disclosure: I maintain this repo and authored the handbook. Free sample (Introduction + Chapter 1) at handbook.mahimai.ca.


Table of contents

πŸ“– Expand the 21 sections
  1. Foundational concepts and learning paths
  2. Frameworks and orchestration platforms
  3. Speech-to-text (STT / ASR)
  4. Text-to-speech (TTS)
  5. LLMs for voice and real-time AI
  6. Voice activity detection and turn-taking
  7. Audio enhancement and noise suppression
  8. WebRTC fundamentals
  9. Telephony and SIP
  10. Tutorials and hands-on projects
  11. GitHub starter repos and awesome lists
  12. Datasets and benchmarks
  13. Beginner-accessible research papers
  14. Evaluation and testing
  15. Production, deployment, and scaling
  16. Ethics, safety, and regulation
  17. Blogs and newsletters
  18. Podcasts
  19. Communities
  20. Conferences and events
  21. Hackathons and competitions

1. Foundational concepts and learning paths

Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you'll fight for the rest of your career.

2. Frameworks and orchestration platforms

The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.

Open-source frameworks

  • LiveKit Agents: Voice AI Quickstart: Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. 🟒 Beginner
  • Pipecat: Quickstart: Scaffolds a Deepgram + OpenAI + Cartesia pipeline via the Pipecat CLI (uv tool install pipecat-ai-cli, then pipecat init quickstart); talk to it in the browser in ~5 minutes. 🟒 Beginner
  • Ultravox (fixie-ai/ultravox): Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. πŸ”΄ Advanced

Managed platforms

Realtime / speech-to-speech APIs

  • OpenAI Realtime API: Guide: Official guide to gpt-realtime (now GA) over WebRTC, WebSockets, or SIP. 🟑 Intermediate
  • Google Gemini Live API: Overview: Low-latency, bidirectional voice + vision agents with barge-in and tool use, on Gemini 3 native audio. 🟑 Intermediate
  • Twilio ConversationRelay: WebSocket bridge that handles STT/TTS so you focus on LLM logic; works with any LLM. 🟑 Intermediate

Vendor-neutral comparisons

3. Speech-to-text (STT / ASR)

Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases. (All-in-one ASR + end-of-turn models like Deepgram Flux are covered under turn-taking.)

Commercial APIs

Open source

  • openai/whisper: The original repo and the de facto starting point for any DIY ASR project. 🟒 Beginner
  • SYSTRAN/faster-whisper: CTranslate2 reimplementation up to 4Γ— faster with INT8; recommended for self-hosted Whisper. 🟑 Intermediate
  • NVIDIA NeMo (Parakeet / Canary): Top-of-leaderboard open ASR models with streaming inference recipes. πŸ”΄ Advanced
  • Moonshine: Tiny on-device ASR (tiny 27M / base 61M params); v2 adds an ergodic streaming encoder built for latency-critical live transcription on edge devices. 🟑 Intermediate

Benchmarks and explainers

4. Text-to-speech (TTS)

Latency, not raw quality, is what kills voice agents: prioritize providers offering true streaming with first-byte under 200 ms.

Commercial APIs

Open source

  • Chatterbox (resemble-ai/chatterbox): Resemble AI's MIT-licensed TTS that beats ElevenLabs in blind preference tests; ~5 s zero-shot voice cloning, emotion-exaggeration control, and a built-in PerTh watermark. Turbo variant (350M) hits sub-150 ms first audio; Multilingual (V3, 0.5B) covers 23+ languages. 🟑 Intermediate
  • Kokoro 82M: Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. 🟒 Beginner
  • Piper (OHF-Voice/piper1-gpl): Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. 🟒 Beginner
  • Coqui TTS (idiap fork): Maintained fork of Coqui-TTS / XTTS v2; still battle-tested, though Chatterbox now leads on zero-shot cloning quality. 🟑 Intermediate
  • Orpheus-TTS: Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. 🟑 Intermediate
  • Sesame CSM: Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. πŸ”΄ Advanced

Streaming and ethics

5. LLMs for voice and real-time AI

A voice agent's perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.

Low-latency inference

  • Groq: LPU-based inference cloud delivering ~10Γ— faster Llama tokens/sec than commodity GPUs. 🟒 Beginner
  • Cerebras Inference: Wafer-scale chip inference with very high throughput on Llama models. 🟒 Beginner
  • SambaNova Cloud: Reconfigurable Dataflow inference; stable throughput at low latency. 🟒 Beginner

Speech-to-speech models

  • OpenAI Realtime API guide: Flagship S2S product with WebRTC/WebSocket transport (gpt-realtime, now GA). 🟑 Intermediate
  • Google Gemini Live: Real-time multimodal voice/video with barge-in and broad language support, on Gemini 3 native audio. 🟑 Intermediate
  • Moshi (kyutai-labs): Open full-duplex speech-text foundation model (~200 ms, Mimi codec). Kyutai's broader stack now includes Unmute (cascaded STT+LLM+TTS with tool use), Kyutai STT/TTS, and Hibiki (streaming translation). πŸ”΄ Advanced
  • Speech-to-Speech Models in 2026: Three Architectural Bets (Krzysztof Sopyla): Vendor-neutral comparison of full-duplex (Moshi), near-duplex multimodal (Qwen-Omni), and cascade approaches, with FullDuplexBench numbers and tradeoffs. 🟑 Intermediate

Voice-specific prompting and tools

6. Voice activity detection and turn-taking

Pure VAD is no longer enough: modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.

  • Silero VAD: MIT-licensed pre-trained VAD; <1 ms per chunk on CPU. The de facto VAD inside LiveKit and Pipecat. 🟒 Beginner
  • py-webrtcvad: Python bindings for Google's classic WebRTC VAD; lightweight baseline. 🟒 Beginner
  • LiveKit Turn Detector: blog post: How a small transformer-based EOU model complements VAD with semantic context. 🟑 Intermediate
  • LiveKit turn-detector model on HuggingFace: Open-weights multilingual EOU model running ONNX on CPU in under 500 MB. 🟑 Intermediate
  • Deepgram Flux: All-in-one conversational STT with built-in end-of-turn detection (median EOT <300 ms), integrated with Deepgram's Voice Agent API; collapses STT and turn detection into a single model. 🟑 Intermediate
  • Pipecat Smart Turn v3: Whisper-Tiny-based audio semantic VAD with fast CPU inference (~12 ms on a standard instance per the v3 repo), BSD-2 licensed. 🟑 Intermediate
  • pipecat-ai/smart-turn: Repo with model code, training scripts, and integration examples (~8M params, Whisper-Tiny base). 🟑 Intermediate
  • Krisp Turn-Taking: Commercial turn-taking model used alongside any STT/LLM/TTS stack. 🟑 Intermediate
  • The Complete Guide to AI Turn-Taking (Tavus): Reader-friendly overview of why pure VAD fails in real conversations. 🟒 Beginner
  • Tackling Turn Detection in Voice AI (Notch): Engineer-first walkthrough combining VAD probability, volume, and TTS markers. 🟑 Intermediate
  • Evaluating End-of-Turn Detection Models (Deepgram): Methodology plus a head-to-head of Flux, Pipecat Smart Turn, and LiveKit EOU; note the commercial author. 🟑 Intermediate
  • ai-coustics VAD: VAD bundled with real-time speech enhancement, noise suppression, and voice isolation in a single audio preprocessing SDK; useful when you want cleanup and turn-taking signals from the same component. 🟒 Beginner

7. Audio enhancement and noise suppression

The audio reaching your VAD and STT is often noisy, reverberant, or mixed with background voices. Cleaning the signal before the rest of the pipeline is frequently the difference between an agent that ships and one that frustrates users in real-world conditions (cars, cafΓ©s, call centres). In 2026 every major voice-AI vendor ships a deep-learning suppressor on top of WebRTC's classic noise-suppression chain.

  • ai-coustics: Real-time speech enhancement SDK covering noise cancellation, voice isolation, and VAD; on-device and cloud deployment. See the docs and developer platform. 🟒 Beginner
  • Krisp SDK: Commercial-grade real-time noise and background-voice cancellation; the de facto standard for voice comms (Python, Node.js, Go, C++ SDKs). LiveKit's background voice cancellation and Pipecat Cloud both build on Krisp. Enterprise access via contact form. 🟒 Beginner
  • DeepFilterNet (Rikorose/DeepFilterNet): Open-source, low-complexity real-time speech enhancement for full-band audio; designed to run on embedded devices. The strongest actively-developed OSS noise suppressor. 🟑 Intermediate
  • RNNoise (xiph/rnnoise): Classic hybrid DSP + deep-learning noise suppression; a tiny, well-understood baseline, but no longer actively maintained. 🟑 Intermediate
  • Koala Noise Suppression (Picovoice): On-device, cross-platform voice isolation with self-serve access (browser, mobile, desktop, Raspberry Pi). 🟒 Beginner
  • Noise Suppression Guide 2026 (Picovoice): Algorithms, intelligibility metrics (SII / STI / STOI), and implementation tradeoffs; note the commercial author. 🟑 Intermediate

8. WebRTC fundamentals

WebRTC is the default transport for voice agents that don't run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.

9. Telephony and SIP

The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.

10. Tutorials and hands-on projects

Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.

11. GitHub starter repos and awesome lists

Clone these instead of writing boilerplate from scratch.

12. Datasets and benchmarks

You'll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.

  • LibriSpeech ASR Corpus: ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. 🟒 Beginner
  • Mozilla Common Voice: Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. 🟒 Beginner
  • Common Voice on HuggingFace: One-line load_dataset() access for hands-on experiments. The official mozilla-foundation releases top out around v17; newer corpus versions (up to v22) are hosted on community mirrors. 🟒 Beginner
  • Open ASR Leaderboard: Live comparison of 60+ ASR models on WER and real-time factor. 🟒 Beginner
  • Artificial Analysis: Speech: Independent benchmarks of commercial STT and TTS providers. 🟒 Beginner
  • LJSpeech Dataset: ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. 🟒 Beginner
  • VCTK Corpus: ~110 English speakers with diverse accents; widely used for multi-speaker TTS. 🟑 Intermediate
  • VoxCeleb (Oxford VGG): Million-utterance "in the wild" dataset for speaker identification and verification. 🟑 Intermediate

13. Beginner-accessible research papers

These are the landmark papers behind the models you'll actually use. Read the Whisper and Common Voice papers first: they're unusually approachable.

14. Evaluation and testing

You can't ship what you can't measure. Voice-agent evaluation is fundamentally probabilistic: a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.

15. Production, deployment, and scaling

Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.

16. Ethics, safety, and regulation

If you're shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.

17. Blogs and newsletters

Subscribe to two or three to stay current: the field moves quickly.

18. Podcasts

19. Communities

20. Conferences and events

  • AI Engineer World's Fair: Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. The 2026 edition runs 29 June - 2 July 2026 at Moscone West, San Francisco.
  • AI Engineer YouTube channel: All World's Fair and Summit talks are posted free; the best library of recent voice-AI talks.
  • AI Engineer Summit Online: Voice playlist: Curated playlist including voice-track sessions from leading labs.
  • AIEWF 2025 Recap (Latent Space): Written deep-dive into 2025's voice-track talks and major launches.
  • VOICE & AI (Modev): Long-running voice technology conference with broader CX and voicebot focus, happening on Oct 5–7, 2026
  • Interspeech 2026: Top academic speech-science conference; intimidating but worth knowing, since most landmark papers debut here. Sydney, Australia, 27 September - 1 October 2026.

21. Hackathons and competitions

  • ElevenHacks (weekly sprints): Weekly themed challenges with credits and prizes; low-pressure way to ship one project per week. 🟒 Beginner
  • AI Engineer World's Fair Hackathon: Co-located with the conference; $10K prizes judged by 3,000+ AI engineers, with a strong voice track, happening on Jun 27 at 9:00 AM - Jun 28 at 5:00 PM (PDT). 🟑 Intermediate
  • lablab.ai AI Hackathons: Continuous calendar of short online hackathons frequently sponsored by voice-AI vendors. 🟒 Beginner
  • Devpost: Voice AI Hackathons: Centralized search for active voice-AI hackathons; the best way to find what's open right now. 🟒 Beginner

Suggested learning path

  1. Week 1: Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 8).
  2. Week 2: First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 10).
  3. Week 3: Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
  4. Week 4: Turn-taking, audio cleanup & telephony: Add Silero VAD, a turn detector, and a speech-enhancement pass; connect a SIP trunk (sections 6, 7, 9).
  5. Week 5: Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 14, 15, 16).
  6. Ongoing: Subscribe to two newsletters, follow Voice AI Space, and join the Voice AI community on LinkedIn group (sections 17, 18, 19).

Contributing

Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals. See CONTRIBUTING.md for the full contribution guide.

⭐ Stargazers and contributors

Star History Chart

Contributors

πŸ“œ License

MIT. Fork it, ship it.