feat(stt): low-end Whisper tier, streaming & reliability fixes by ijbo · Pull Request #5 · Textagent/textagent.github.io

ijbo · 2026-06-22T09:15:13Z

What

Three improvements to the speech-to-text fallback path, found via a code audit of the audio architecture. The WASM STT fallback previously hard-loaded the ~800 MB Whisper-Large-V3-Turbo model, returned one-shot (non-streaming) results, and had two reliability bugs.

Changes

🎙️ Low-end device tier + streaming (speech-worker.js, speechToText.js)

Devices with deviceMemory ≤ 4 GB or hardwareConcurrency ≤ 4 now load multilingual whisper-tiny (~75 MB q4) instead of whisper-large-v3-turbo (~800 MB) — working dictation on Chromebooks/older phones that previously couldn't load any STT model.
Used the multilingual tiny model, not tiny.en, so the 14-language support is preserved on exactly those devices.
Unknown devices (non-Chromium, no deviceMemory) safely default to the full model — capable machines are never downgraded.
Wired WhisperTextStreamer so partial text streams as tokens decode (live interim feedback, closing the parity gap with Voxtral, which already streamed).
Consent popup now shows the right size (~75 MB on low-end vs ~800 MB).

🐛 WebGPU-detection race (speechToText.js)

Detection was async + fire-and-forget, so the engine name/indicator could be stale before it resolved. Added a webGPUResolved flag, a post-detection indicator refresh, and a public M.speechToText.ready() that resolves once the engine choice is final. (Worker selection itself was already correctly gated.)

📱 Silent microphone failure on mobile (speechToText.js)

The neural engine's getUserMedia is opened separately from the Web Speech API's internal stream; on mobile the second request can be denied while Web Speech keeps working — previously failing silently. Now cleans up partial state and surfaces a toast + interim message ("using Web Speech only"), distinguishing permission-denied from other failures.

Testing

Vite build compiles clean (validates the WhisperTextStreamer import + module syntax).
stt-tag.spec.js + speech-commands.spec.js: 22/22 passing, no regressions.
Verified live: ready() resolves with the final engine; tier heuristic correct across boundary cases (≤4 GB → tiny, ≤4 cores → tiny, unknown → turbo).

Context

This came out of evaluating whether jax-js could improve TextAgent's audio stack. Conclusion: jax-js wasn't the right fit (it'd add a second early-stage runtime alongside the unremovable transformers.js/ONNX stack while duplicating existing capabilities). The one real gap it pointed at — a low-end STT fallback — is fixed here natively, with no new dependency.

🤖 Generated with Claude Code

Add a device-capability tier to the WASM speech-to-text fallback and fix two audit-found reliability issues. - low-end tier: devices with <=4GB RAM or <=4 cores load multilingual whisper-tiny (~75MB q4) instead of whisper-large-v3-turbo (~800MB); uses the MULTILINGUAL tiny model (not tiny.en) so 14-language support is preserved - streaming: WhisperTextStreamer emits partial text as tokens decode, posted as `partial` messages for live interim feedback (parity with Voxtral) - fix race: WebGPU detection was fire-and-forget, so the engine name could be stale; add webGPUResolved flag, indicator refresh, and M.speechToText.ready() - fix silent failure: the neural engine's getUserMedia is separate from Web Speech's stream; a denied second request on mobile failed silently. Now cleans up and surfaces a toast ("using Web Speech only") Vite build clean; 22/22 stt-tag + speech-command tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…branch}}) docker/metadata-action used type=sha,prefix={{branch}}- but {{branch}} is empty on pull_request events, producing an invalid tag :-<sha>. Use a static sha- prefix; branch/PR identity is still captured by the ref-based tag rules. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ijbo and others added 2 commits June 22, 2026 18:14

ijbo mentioned this pull request Jun 22, 2026

feat: onnx-community upgrades — Kokoro WebGPU, Moonshine STT, SmolVLM, transformers.js v4 #6

Merged

ijbo merged commit 79ffe88 into main Jun 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stt): low-end Whisper tier, streaming & reliability fixes#5

feat(stt): low-end Whisper tier, streaming & reliability fixes#5
ijbo merged 2 commits into
mainfrom
fix/stt-audio-improvements

ijbo commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ijbo commented Jun 22, 2026

What

Changes

🎙️ Low-end device tier + streaming (speech-worker.js, speechToText.js)

🐛 WebGPU-detection race (speechToText.js)

📱 Silent microphone failure on mobile (speechToText.js)

Testing

Context

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant