Skip to content

feat(stt): low-end Whisper tier, streaming & reliability fixes#5

Merged
ijbo merged 2 commits into
mainfrom
fix/stt-audio-improvements
Jun 22, 2026
Merged

feat(stt): low-end Whisper tier, streaming & reliability fixes#5
ijbo merged 2 commits into
mainfrom
fix/stt-audio-improvements

Conversation

@ijbo

@ijbo ijbo commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What

Three improvements to the speech-to-text fallback path, found via a code audit of the audio architecture. The WASM STT fallback previously hard-loaded the ~800 MB Whisper-Large-V3-Turbo model, returned one-shot (non-streaming) results, and had two reliability bugs.

Changes

🎙️ Low-end device tier + streaming (speech-worker.js, speechToText.js)

  • Devices with deviceMemory ≤ 4 GB or hardwareConcurrency ≤ 4 now load multilingual whisper-tiny (~75 MB q4) instead of whisper-large-v3-turbo (~800 MB) — working dictation on Chromebooks/older phones that previously couldn't load any STT model.
  • Used the multilingual tiny model, not tiny.en, so the 14-language support is preserved on exactly those devices.
  • Unknown devices (non-Chromium, no deviceMemory) safely default to the full model — capable machines are never downgraded.
  • Wired WhisperTextStreamer so partial text streams as tokens decode (live interim feedback, closing the parity gap with Voxtral, which already streamed).
  • Consent popup now shows the right size (~75 MB on low-end vs ~800 MB).

🐛 WebGPU-detection race (speechToText.js)

Detection was async + fire-and-forget, so the engine name/indicator could be stale before it resolved. Added a webGPUResolved flag, a post-detection indicator refresh, and a public M.speechToText.ready() that resolves once the engine choice is final. (Worker selection itself was already correctly gated.)

📱 Silent microphone failure on mobile (speechToText.js)

The neural engine's getUserMedia is opened separately from the Web Speech API's internal stream; on mobile the second request can be denied while Web Speech keeps working — previously failing silently. Now cleans up partial state and surfaces a toast + interim message ("using Web Speech only"), distinguishing permission-denied from other failures.

Testing

  • Vite build compiles clean (validates the WhisperTextStreamer import + module syntax).
  • stt-tag.spec.js + speech-commands.spec.js: 22/22 passing, no regressions.
  • Verified live: ready() resolves with the final engine; tier heuristic correct across boundary cases (≤4 GB → tiny, ≤4 cores → tiny, unknown → turbo).

Context

This came out of evaluating whether jax-js could improve TextAgent's audio stack. Conclusion: jax-js wasn't the right fit (it'd add a second early-stage runtime alongside the unremovable transformers.js/ONNX stack while duplicating existing capabilities). The one real gap it pointed at — a low-end STT fallback — is fixed here natively, with no new dependency.

🤖 Generated with Claude Code

ijbo and others added 2 commits June 22, 2026 18:14
Add a device-capability tier to the WASM speech-to-text fallback and fix two
audit-found reliability issues.

- low-end tier: devices with <=4GB RAM or <=4 cores load multilingual
  whisper-tiny (~75MB q4) instead of whisper-large-v3-turbo (~800MB); uses the
  MULTILINGUAL tiny model (not tiny.en) so 14-language support is preserved
- streaming: WhisperTextStreamer emits partial text as tokens decode, posted as
  `partial` messages for live interim feedback (parity with Voxtral)
- fix race: WebGPU detection was fire-and-forget, so the engine name could be
  stale; add webGPUResolved flag, indicator refresh, and M.speechToText.ready()
- fix silent failure: the neural engine's getUserMedia is separate from Web
  Speech's stream; a denied second request on mobile failed silently. Now cleans
  up and surfaces a toast ("using Web Speech only")

Vite build clean; 22/22 stt-tag + speech-command tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…branch}})

docker/metadata-action used type=sha,prefix={{branch}}- but {{branch}} is
empty on pull_request events, producing an invalid tag :-<sha>. Use a static
sha- prefix; branch/PR identity is still captured by the ref-based tag rules.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ijbo ijbo merged commit 79ffe88 into main Jun 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant