Skip to content

companion-ws: accept voice:transcript upstream messages from mobile (F007) #357

@jpelaez-23blocks

Description

@jpelaez-23blocks

Context

Mobile app ai-maestro-app shipped v0.4.0 with F007 Phase 1 — hands-free voice conversations with agents. The phone captures the user's speech on-device via iOS Speech Framework (expo-speech-recognition) and sends the final transcript over the existing /companion-ws socket.

The mobile side is wired and shipped. Server-side handling is missing — until this lands, the backend just receives voice:transcript messages and ignores them, so calls only flow one direction (agent → user).

Requested change

companion-ws should accept upstream messages of the form:

{ "type": "voice:transcript", "text": "rebase that branch onto main, fix conflicts, push", "isFinal": true }

…and route the text field into the agent's normal text-input pipeline — the same way a typed user message in the chat panel does today. The agent should respond with its usual speech downstream messages, which the phone speaks aloud.

For now we only send isFinal: true messages from the phone (partials stay on-device). Future phases may send isFinal: false for live partial display on web, but that's not required for this feature to ship.

Acceptance criteria

  • companion-ws parses incoming JSON with type === 'voice:transcript' and extracts text
  • Empty / whitespace-only transcripts are dropped silently
  • The text is delivered to the agent through the same code path as a typed /chat message — including any prompt-engineering wrappers the agent normally gets
  • Existing downstream speech messages keep working end-to-end (no regression)
  • Optional: log the transcript with a source: 'voice' tag for analytics later

Related

  • Mobile-side spec: backlog/F007-talk-to-agent-via-voice.md in 23blocks/ai-maestro-app
  • Companion side: see the existing useCompanionWS.ts onSpeech handler and the speech downstream message
  • Pairs with: companion-ws: accept voice:interrupt upstream messages from mobile (F007) — the barge-in counterpart

Out of scope

  • Cloud STT (we're using on-device for Phase 1; Phase 3 may add Deepgram backend STT)
  • Premium TTS (Phase 2 will swap expo-speech for Cartesia/ElevenLabs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions