feat(stt): low-end Whisper tier, streaming & reliability fixes#5
Merged
Conversation
Add a device-capability tier to the WASM speech-to-text fallback and fix two
audit-found reliability issues.
- low-end tier: devices with <=4GB RAM or <=4 cores load multilingual
whisper-tiny (~75MB q4) instead of whisper-large-v3-turbo (~800MB); uses the
MULTILINGUAL tiny model (not tiny.en) so 14-language support is preserved
- streaming: WhisperTextStreamer emits partial text as tokens decode, posted as
`partial` messages for live interim feedback (parity with Voxtral)
- fix race: WebGPU detection was fire-and-forget, so the engine name could be
stale; add webGPUResolved flag, indicator refresh, and M.speechToText.ready()
- fix silent failure: the neural engine's getUserMedia is separate from Web
Speech's stream; a denied second request on mobile failed silently. Now cleans
up and surfaces a toast ("using Web Speech only")
Vite build clean; 22/22 stt-tag + speech-command tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…branch}})
docker/metadata-action used type=sha,prefix={{branch}}- but {{branch}} is
empty on pull_request events, producing an invalid tag :-<sha>. Use a static
sha- prefix; branch/PR identity is still captured by the ref-based tag rules.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Three improvements to the speech-to-text fallback path, found via a code audit of the audio architecture. The WASM STT fallback previously hard-loaded the ~800 MB Whisper-Large-V3-Turbo model, returned one-shot (non-streaming) results, and had two reliability bugs.
Changes
🎙️ Low-end device tier + streaming (speech-worker.js, speechToText.js)
deviceMemory ≤ 4 GBorhardwareConcurrency ≤ 4now load multilingualwhisper-tiny(~75 MB q4) instead ofwhisper-large-v3-turbo(~800 MB) — working dictation on Chromebooks/older phones that previously couldn't load any STT model.tiny.en, so the 14-language support is preserved on exactly those devices.deviceMemory) safely default to the full model — capable machines are never downgraded.WhisperTextStreamerso partial text streams as tokens decode (live interim feedback, closing the parity gap with Voxtral, which already streamed).🐛 WebGPU-detection race (speechToText.js)
Detection was async + fire-and-forget, so the engine name/indicator could be stale before it resolved. Added a
webGPUResolvedflag, a post-detection indicator refresh, and a publicM.speechToText.ready()that resolves once the engine choice is final. (Worker selection itself was already correctly gated.)📱 Silent microphone failure on mobile (speechToText.js)
The neural engine's
getUserMediais opened separately from the Web Speech API's internal stream; on mobile the second request can be denied while Web Speech keeps working — previously failing silently. Now cleans up partial state and surfaces a toast + interim message ("using Web Speech only"), distinguishing permission-denied from other failures.Testing
WhisperTextStreamerimport + module syntax).stt-tag.spec.js+speech-commands.spec.js: 22/22 passing, no regressions.ready()resolves with the final engine; tier heuristic correct across boundary cases (≤4 GB → tiny, ≤4 cores → tiny, unknown → turbo).Context
This came out of evaluating whether jax-js could improve TextAgent's audio stack. Conclusion: jax-js wasn't the right fit (it'd add a second early-stage runtime alongside the unremovable transformers.js/ONNX stack while duplicating existing capabilities). The one real gap it pointed at — a low-end STT fallback — is fixed here natively, with no new dependency.
🤖 Generated with Claude Code