Context
Mobile app ai-maestro-app shipped v0.4.0 with F007 Phase 1 — hands-free voice conversations with agents. The phone captures the user's speech on-device via iOS Speech Framework (expo-speech-recognition) and sends the final transcript over the existing /companion-ws socket.
The mobile side is wired and shipped. Server-side handling is missing — until this lands, the backend just receives voice:transcript messages and ignores them, so calls only flow one direction (agent → user).
Requested change
companion-ws should accept upstream messages of the form:
{ "type": "voice:transcript", "text": "rebase that branch onto main, fix conflicts, push", "isFinal": true }
…and route the text field into the agent's normal text-input pipeline — the same way a typed user message in the chat panel does today. The agent should respond with its usual speech downstream messages, which the phone speaks aloud.
For now we only send isFinal: true messages from the phone (partials stay on-device). Future phases may send isFinal: false for live partial display on web, but that's not required for this feature to ship.
Acceptance criteria
Related
- Mobile-side spec:
backlog/F007-talk-to-agent-via-voice.md in 23blocks/ai-maestro-app
- Companion side: see the existing
useCompanionWS.ts onSpeech handler and the speech downstream message
- Pairs with: companion-ws: accept voice:interrupt upstream messages from mobile (F007) — the barge-in counterpart
Out of scope
- Cloud STT (we're using on-device for Phase 1; Phase 3 may add Deepgram backend STT)
- Premium TTS (Phase 2 will swap
expo-speech for Cartesia/ElevenLabs)
Context
Mobile app
ai-maestro-appshipped v0.4.0 with F007 Phase 1 — hands-free voice conversations with agents. The phone captures the user's speech on-device via iOS Speech Framework (expo-speech-recognition) and sends the final transcript over the existing/companion-wssocket.The mobile side is wired and shipped. Server-side handling is missing — until this lands, the backend just receives
voice:transcriptmessages and ignores them, so calls only flow one direction (agent → user).Requested change
companion-wsshould accept upstream messages of the form:{ "type": "voice:transcript", "text": "rebase that branch onto main, fix conflicts, push", "isFinal": true }…and route the
textfield into the agent's normal text-input pipeline — the same way a typed user message in the chat panel does today. The agent should respond with its usualspeechdownstream messages, which the phone speaks aloud.For now we only send
isFinal: truemessages from the phone (partials stay on-device). Future phases may sendisFinal: falsefor live partial display on web, but that's not required for this feature to ship.Acceptance criteria
companion-wsparses incoming JSON withtype === 'voice:transcript'and extractstext/chatmessage — including any prompt-engineering wrappers the agent normally getsspeechmessages keep working end-to-end (no regression)source: 'voice'tag for analytics laterRelated
backlog/F007-talk-to-agent-via-voice.mdin23blocks/ai-maestro-appuseCompanionWS.tsonSpeechhandler and thespeechdownstream messageOut of scope
expo-speechfor Cartesia/ElevenLabs)