Textagent · ijbo · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
@@ -39,7 +39,7 @@ jobs:
         tags: |
           type=ref,event=branch
           type=ref,event=pr
-          type=sha,prefix={{branch}}-
+          type=sha,prefix=sha-
           type=raw,value=latest,enable={{is_default_branch}}
 
     - name: Build and push Docker image

diff --git a/changelogs/CHANGELOG-docker-tag-fix.md b/changelogs/CHANGELOG-docker-tag-fix.md
@@ -0,0 +1,11 @@
+# Fix invalid Docker image tag on PR builds
+
+- Fixed the `build-and-push` GitHub Actions job failing on pull requests with `invalid tag "...:-<sha>": invalid reference format`
+- Root cause: `docker/metadata-action` used `type=sha,prefix={{branch}}-`, but `{{branch}}` resolves to an empty string on `pull_request` events, producing a tag with a leading hyphen (`:-b0503a8`) — which Docker rejects
+- Changed the SHA tag prefix to a static `sha-`, valid across branch pushes, PRs, and the default branch; branch/PR identity is still captured by the separate `type=ref,event=branch` and `type=ref,event=pr` tag rules
+
+---
+
+## Summary
+
+The Docker publish workflow couldn't build on any pull request because the SHA-based image tag was constructed with `{{branch}}-`, and the `branch` template is empty on PR events — yielding an invalid `:-<sha>` tag. Using a constant `sha-` prefix fixes PR builds without losing branch/PR information (the ref-based tags already encode that).
diff --git a/changelogs/CHANGELOG-stt-audio-improvements.md b/changelogs/CHANGELOG-stt-audio-improvements.md
@@ -0,0 +1,44 @@
+# Speech-to-Text Improvements — Low-End Tier, Streaming & Reliability Fixes
+
+- Added a **low-end device tier** to the Whisper (WASM) fallback: on low-RAM / few-core devices (`navigator.deviceMemory ≤ 4 GB` or `hardwareConcurrency ≤ 4`), `speech-worker.js` now loads **multilingual `whisper-tiny`** (~75 MB, q4) instead of `whisper-large-v3-turbo` (~800 MB) — giving working dictation on Chromebooks / older phones that previously couldn't load any STT model
+- Used **multilingual `whisper-tiny`, not `tiny.en`** — the 14-language support is preserved on exactly the devices that fall back to it
+- Added **streaming partial results** to the Whisper path via `WhisperTextStreamer` — interim text now appears as tokens decode instead of a blank field until the final result (closes the UX-parity gap with Voxtral, which already streamed)
+- **Fixed (race):** WebGPU detection was async + fire-and-forget, so `sttModelName` and the engine indicator could show a stale value before detection resolved; added a `webGPUResolved` flag, a post-detection indicator refresh, and a public `M.speechToText.ready()` that resolves once the engine choice is final
+- **Fixed (silent failure):** the neural engine's microphone (`getUserMedia` in `startAudioCapture`) is opened separately from the Web Speech API's internal stream; on mobile the second request can be denied while Web Speech keeps working, previously failing silently — it now cleans up partial state and surfaces a toast + interim message ("using Web Speech only"), distinguishing permission-denied from other failures
+- Consent popup now reflects the chosen tier (shows "~75 MB / Whisper Tiny" on low-end devices instead of always "~800 MB")
+
+---
+
+## Summary
+
+The WASM speech-to-text fallback previously hard-loaded the ~800 MB Whisper-Large-V3-Turbo model with one-shot (non-streaming) results and no path for low-end devices. This adds a device-capability-aware tier that loads a ~75 MB multilingual Whisper-Tiny on constrained hardware, streams partial transcriptions as they decode, and fixes two reliability issues found in an audit: a WebGPU-detection race that could surface the wrong engine name, and a silent microphone failure on mobile that left users believing the neural engine was running when only Web Speech was.
+
+---
+
+## 1. Low-End STT Tier
+**Files:** `js/speech-worker.js`, `js/speechToText.js`
+**What:** Added a `TIERS` map (`turbo` → whisper-large-v3-turbo q8; `tiny` → whisper-tiny q4) and a `pickTier()` device probe in the worker, mirrored by `pickWhisperTier()` on the main thread. The caller passes a `tier` hint in the `init` message (only for the WASM/Whisper path; Voxtral/WebGPU ignores it). Unknown devices (non-Chromium, no `deviceMemory`) safely default to `turbo` so capable machines are never downgraded.
+**Impact:** Devices with ≤4 GB RAM or ≤4 cores get a ~75 MB model that actually loads and runs, instead of failing on the 800 MB download.
+
+## 2. Whisper Streaming
+**Files:** `js/speech-worker.js`
+**What:** Wrapped transcription in a `WhisperTextStreamer` (transformers.js 3.8.1) whose `callback_function` posts `partial` messages as text decodes. The main thread already had a `partial` handler that renders interim text. Falls back to one-shot if the streamer can't be constructed.
+**Impact:** Live interim feedback during transcription instead of a blank field for the duration of the clip.
+
+## 3. WebGPU-Detection Race Fix
+**Files:** `js/speechToText.js`
+**What:** Added `webGPUResolved`; the detection IIFE now refreshes the engine indicator on completion, and a new `M.speechToText.ready()` resolves once `webGPUPromise` settles. `getEngines()` exposes `webGPUResolved`.
+**Impact:** UI/API never report a stale STT engine while detection is mid-flight. (Worker selection was already correctly gated on `webGPUPromise`.)
+
+## 4. Silent Microphone Failure Fix
+**Files:** `js/speechToText.js`
+**What:** `startAudioCapture`'s catch block now disconnects partial audio nodes, stops the stream, and shows a toast + interim message instead of only `console.warn`. Detects `NotAllowedError` / `NotFoundError` / `SecurityError` for a clearer "couldn't access the mic" message.
+**Impact:** Users are told when the higher-quality engine isn't running, instead of silently getting only Web Speech.
+
+---
+
+## Testing
+
+- Vite build compiles clean (validates the `WhisperTextStreamer` import and all module syntax).
+- `stt-tag.spec.js` + `speech-commands.spec.js`: 22/22 passing, no regressions.
+- Verified live: `ready()` resolves with the final engine; tier heuristic correct across boundary cases (≤4 GB → tiny, unknown → turbo).
diff --git a/js/speech-worker.js b/js/speech-worker.js
@@ -1,28 +1,62 @@
 // ============================================
-// speech-worker.js — Whisper Large V3 Turbo ASR WebWorker (WASM fallback)
+// speech-worker.js — Whisper ASR WebWorker (WASM fallback)
 // Used when WebGPU is NOT available. WebGPU devices use voxtral-worker.js.
-// Runs textagent/whisper-large-v3-turbo via @huggingface/transformers
-// off the main thread for jank-free transcription.
-// WER ~7.7% (batched)
+// Runs Whisper via @huggingface/transformers off the main thread.
+//
+// Two tiers (chosen by device capability, or forced by the caller via `tier`):
+//   • 'turbo'  → whisper-large-v3-turbo (~800 MB q8, WER ~7.7%) — default
+//   • 'tiny'   → whisper-tiny           (~75 MB q4)            — low-end devices
+// IMPORTANT: the low-end model is MULTILINGUAL whisper-tiny, NOT tiny.en, so the
+// 14-language support is preserved on exactly the devices that fall back to it.
+//
+// Streaming: a WhisperTextStreamer emits partial text as tokens decode, posted as
+// `partial` messages (the main thread renders these as live interim text).
 // ============================================
-import { pipeline, env } from '@huggingface/transformers';
+import { pipeline, env, WhisperTextStreamer } from '@huggingface/transformers';
 
 // Model host — downloads ONNX models from textagent HuggingFace org
 const MODEL_HOST = 'https://huggingface.co';
 const MODEL_ORG_FALLBACK = 'onnx-community';
 env.remoteHost = MODEL_HOST;
 
 let transcriber = null;
+let activeTier = 'turbo';
+
+// Tier definitions. dtype/model id chosen per tier; both resolve under the
+// textagent org first, then fall back to onnx-community.
+const TIERS = {
+    turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' },
+    tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…' },
+};
+
+// Decide a tier from device capability when the caller doesn't force one.
+// Heuristic: low RAM or few cores → the lightweight model. deviceMemory is in GB
+// (Chromium-only; undefined elsewhere, in which case we keep the default turbo).
+function pickTier() {
+    const mem = typeof navigator !== 'undefined' ? navigator.deviceMemory : undefined;
+    const cores = typeof navigator !== 'undefined' ? navigator.hardwareConcurrency : undefined;
+    if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) {
+        return 'tiny';
+    }
+    return 'turbo';
+}
 
 self.addEventListener('message', async (e) => {
     const { type, audio } = e.data;
 
     if (type === 'init') {
         try {
-            self.postMessage({ type: 'status', status: 'loading', message: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' });
+            // Caller may force a tier ('tiny' | 'turbo'); otherwise probe the device.
+            activeTier = (e.data.tier === 'tiny' || e.data.tier === 'turbo') ? e.data.tier : pickTier();
+            const tier = TIERS[activeTier];
+
+            self.postMessage({ type: 'status', status: 'loading', message: tier.dlMsg });
+
+            // whisperModelId is mutated on org fallback; referenced by the progress callback.
+            let whisperModelId = tier.id;
 
             const pipelineOpts = {
-                dtype: 'q8',
+                dtype: tier.dtype,
                 device: 'wasm',
                 progress_callback: (progress) => {
                     if (progress.status === 'progress') {
@@ -49,7 +83,6 @@ self.addEventListener('message', async (e) => {
             };
 
             // Try primary org (textagent), fall back to onnx-community
-            let whisperModelId = 'textagent/whisper-large-v3-turbo';
             try {
                 transcriber = await pipeline(
                     'automatic-speech-recognition',
@@ -70,9 +103,10 @@ self.addEventListener('message', async (e) => {
             self.postMessage({
                 type: 'status',
                 status: 'ready',
-                message: 'Whisper ready',
+                message: tier.label + ' ready',
                 device: 'CPU (WASM)',
-                model: 'Whisper V3 Turbo',
+                model: tier.label,
+                tier: activeTier,
             });
         } catch (err) {
             self.postMessage({ type: 'error', message: err.message || String(err) });
@@ -103,9 +137,31 @@ self.addEventListener('message', async (e) => {
 
             // Use language from caller, default to 'en'
             const lang = e.data.lang || 'en';
+
+            // Stream partial text as tokens decode so the user sees live interim
+            // results instead of staring at a blank field until the final result.
+            // WhisperTextStreamer skips special tokens and only emits readable text.
+            let streamed = '';
+            let streamer = null;
+            try {
+                streamer = new WhisperTextStreamer(transcriber.tokenizer, {
+                    skip_prompt: true,
+                    callback_function: (partial) => {
+                        streamed += partial;
+                        const t = streamed.trim();
+                        if (t) self.postMessage({ type: 'partial', text: t });
+                    },
+                });
+            } catch (_) {
+                // If the streamer can't be constructed for any reason, fall back to
+                // a plain one-shot transcription below (streamer stays null).
+                streamer = null;
+            }
+
             const result = await transcriber(normalizedAudio, {
                 language: lang,
                 return_timestamps: false,
+                streamer: streamer || undefined,
             });
             self.postMessage({ type: 'result', text: result.text });
         } catch (err) {

diff --git a/js/speechToText.js b/js/speechToText.js
@@ -18,6 +18,7 @@
     // ── WebGPU Detection ──────────────────────────
     // Determines whether to use Voxtral (WebGPU) or Whisper (WASM)
     let hasWebGPU = false;
+    let webGPUResolved = false;             // true once detection has completed
     let sttModelName = 'Whisper V3 Turbo';  // default, updated after detection
     const webGPUPromise = (async () => {
         if (typeof navigator !== 'undefined' && navigator.gpu) {
@@ -26,8 +27,25 @@
                 if (adapter) { hasWebGPU = true; sttModelName = 'Voxtral Mini 3B'; }
             } catch (_) { /* no WebGPU */ }
         }
+        webGPUResolved = true;
+        // Detection finished after the UI may have rendered with the default model
+        // name — refresh any visible engine label so it reflects the real engine.
+        try { updateEngineIndicator(); } catch (_) { /* indicator not built yet */ }
     })();
 
+    // ── Low-end device probe (for the WASM/Whisper fallback only) ──
+    // On low-RAM / few-core devices the ~800 MB Whisper-Large model is impractical,
+    // so we ask speech-worker.js to load multilingual whisper-tiny (~75 MB) instead.
+    // Returns 'tiny' | 'turbo'. WebGPU devices use Voxtral and ignore this.
+    function pickWhisperTier() {
+        const mem = navigator.deviceMemory;        // GB, Chromium-only
+        const cores = navigator.hardwareConcurrency;
+        if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) {
+            return 'tiny';
+        }
+        return 'turbo';
+    }
+
     // ── State ──────────────────────────────────────
     let isListening = false;
     let lastInsertTime = Date.now();
@@ -627,7 +645,22 @@
             processorNode.connect(audioContext.destination);
             console.log('🧠 Audio capture started at', audioContext.sampleRate, 'Hz');
         } catch (err) {
+            // The mic for the neural engine (Whisper/Voxtral) is opened separately from the
+            // Web Speech API's internal stream. On mobile, this second getUserMedia() can be
+            // denied or fail while Web Speech keeps working — previously this failed silently,
+            // leaving the user to believe the higher-quality engine was running. Surface it.
             console.warn('🧠 Audio capture failed:', err);
+            // Clean up any partially-created nodes/stream so a retry starts fresh.
+            if (processorNode) { try { processorNode.disconnect(); } catch (_) {} processorNode = null; }
+            if (sourceNode) { try { sourceNode.disconnect(); } catch (_) {} sourceNode = null; }
+            if (audioContext) { audioContext.close().catch(() => {}); audioContext = null; }
+            if (mediaStream) { mediaStream.getTracks().forEach(t => t.stop()); mediaStream = null; }
+            const denied = err && (err.name === 'NotAllowedError' || err.name === 'NotFoundError' || err.name === 'SecurityError');
+            const msg = denied
+                ? `🎙️ ${sttModelName} couldn't access the mic — using Web Speech only`
+                : `🎙️ ${sttModelName} audio capture failed — using Web Speech only`;
+            if (interimText) interimText.textContent = msg;
+            if (M.showToast) M.showToast(msg, 'warning');
         }
     }
 
@@ -921,8 +954,12 @@
     // ── STT Model Download Consent Popup ────────────
     function showSttConsentPopup(modelType) {
         const isVoxtral = modelType === 'voxtral';
-        const modelName = isVoxtral ? 'Voxtral Mini 3B' : 'Whisper Large V3 Turbo';
-        const downloadSize = isVoxtral ? '~2.7 GB' : '~800 MB';
+        // On low-end devices the Whisper path loads the lightweight tiny model —
+        // reflect that in the consent popup so the size estimate is honest.
+        const whisperTier = isVoxtral ? null : pickWhisperTier();
+        const isTiny = whisperTier === 'tiny';
+        const modelName = isVoxtral ? 'Voxtral Mini 3B' : (isTiny ? 'Whisper Tiny' : 'Whisper Large V3 Turbo');
+        const downloadSize = isVoxtral ? '~2.7 GB' : (isTiny ? '~75 MB' : '~800 MB');
         const deviceLabel = isVoxtral ? 'GPU (WebGPU)' : 'CPU (WASM)';
 
         // Create overlay
@@ -964,7 +1001,7 @@
             // Proceed with model download
             initWorker();
             if (!modelReady && !modelLoading) {
-                worker.postMessage({ type: 'init' });
+                worker.postMessage({ type: 'init', tier: hasWebGPU ? undefined : pickWhisperTier() });
             }
         });
 
@@ -1008,7 +1045,8 @@
                 // Already consented or model cached — proceed directly
                 initWorker();
                 if (!modelReady && !modelLoading) {
-                    worker.postMessage({ type: 'init' });
+                    // tier only matters for the Whisper (WASM) worker; Voxtral ignores it
+                    worker.postMessage({ type: 'init', tier: hasWebGPU ? undefined : pickWhisperTier() });
                 } else if (modelReady) {
                     startAudioCapture();
                 }
@@ -1114,8 +1152,11 @@
             sttModel: sttModelName,
             sttReady: modelReady,
             webGPU: hasWebGPU,
+            webGPUResolved: webGPUResolved,
             aiRefine: aiRefineEnabled,
         }),
+        /** Resolves once WebGPU detection has completed (engine choice is final). */
+        ready: () => webGPUPromise.then(() => M.speechToText.getEngines()),
         /** Start recording in card mode — text routes to callbacks instead of editor */
         startForCard: (onText, onInterim) => {
             // Force-stop any active session first (allows re-recording after Clear)