diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml index b489d2f..6a5177a 100644 --- a/.github/workflows/docker-publish.yml +++ b/.github/workflows/docker-publish.yml @@ -39,7 +39,7 @@ jobs: tags: | type=ref,event=branch type=ref,event=pr - type=sha,prefix={{branch}}- + type=sha,prefix=sha- type=raw,value=latest,enable={{is_default_branch}} - name: Build and push Docker image diff --git a/changelogs/CHANGELOG-docker-tag-fix.md b/changelogs/CHANGELOG-docker-tag-fix.md new file mode 100644 index 0000000..b1fcecf --- /dev/null +++ b/changelogs/CHANGELOG-docker-tag-fix.md @@ -0,0 +1,11 @@ +# Fix invalid Docker image tag on PR builds + +- Fixed the `build-and-push` GitHub Actions job failing on pull requests with `invalid tag "...:-": invalid reference format` +- Root cause: `docker/metadata-action` used `type=sha,prefix={{branch}}-`, but `{{branch}}` resolves to an empty string on `pull_request` events, producing a tag with a leading hyphen (`:-b0503a8`) — which Docker rejects +- Changed the SHA tag prefix to a static `sha-`, valid across branch pushes, PRs, and the default branch; branch/PR identity is still captured by the separate `type=ref,event=branch` and `type=ref,event=pr` tag rules + +--- + +## Summary + +The Docker publish workflow couldn't build on any pull request because the SHA-based image tag was constructed with `{{branch}}-`, and the `branch` template is empty on PR events — yielding an invalid `:-` tag. Using a constant `sha-` prefix fixes PR builds without losing branch/PR information (the ref-based tags already encode that). diff --git a/changelogs/CHANGELOG-stt-audio-improvements.md b/changelogs/CHANGELOG-stt-audio-improvements.md new file mode 100644 index 0000000..ecdcbe0 --- /dev/null +++ b/changelogs/CHANGELOG-stt-audio-improvements.md @@ -0,0 +1,44 @@ +# Speech-to-Text Improvements — Low-End Tier, Streaming & Reliability Fixes + +- Added a **low-end device tier** to the Whisper (WASM) fallback: on low-RAM / few-core devices (`navigator.deviceMemory ≤ 4 GB` or `hardwareConcurrency ≤ 4`), `speech-worker.js` now loads **multilingual `whisper-tiny`** (~75 MB, q4) instead of `whisper-large-v3-turbo` (~800 MB) — giving working dictation on Chromebooks / older phones that previously couldn't load any STT model +- Used **multilingual `whisper-tiny`, not `tiny.en`** — the 14-language support is preserved on exactly the devices that fall back to it +- Added **streaming partial results** to the Whisper path via `WhisperTextStreamer` — interim text now appears as tokens decode instead of a blank field until the final result (closes the UX-parity gap with Voxtral, which already streamed) +- **Fixed (race):** WebGPU detection was async + fire-and-forget, so `sttModelName` and the engine indicator could show a stale value before detection resolved; added a `webGPUResolved` flag, a post-detection indicator refresh, and a public `M.speechToText.ready()` that resolves once the engine choice is final +- **Fixed (silent failure):** the neural engine's microphone (`getUserMedia` in `startAudioCapture`) is opened separately from the Web Speech API's internal stream; on mobile the second request can be denied while Web Speech keeps working, previously failing silently — it now cleans up partial state and surfaces a toast + interim message ("using Web Speech only"), distinguishing permission-denied from other failures +- Consent popup now reflects the chosen tier (shows "~75 MB / Whisper Tiny" on low-end devices instead of always "~800 MB") + +--- + +## Summary + +The WASM speech-to-text fallback previously hard-loaded the ~800 MB Whisper-Large-V3-Turbo model with one-shot (non-streaming) results and no path for low-end devices. This adds a device-capability-aware tier that loads a ~75 MB multilingual Whisper-Tiny on constrained hardware, streams partial transcriptions as they decode, and fixes two reliability issues found in an audit: a WebGPU-detection race that could surface the wrong engine name, and a silent microphone failure on mobile that left users believing the neural engine was running when only Web Speech was. + +--- + +## 1. Low-End STT Tier +**Files:** `js/speech-worker.js`, `js/speechToText.js` +**What:** Added a `TIERS` map (`turbo` → whisper-large-v3-turbo q8; `tiny` → whisper-tiny q4) and a `pickTier()` device probe in the worker, mirrored by `pickWhisperTier()` on the main thread. The caller passes a `tier` hint in the `init` message (only for the WASM/Whisper path; Voxtral/WebGPU ignores it). Unknown devices (non-Chromium, no `deviceMemory`) safely default to `turbo` so capable machines are never downgraded. +**Impact:** Devices with ≤4 GB RAM or ≤4 cores get a ~75 MB model that actually loads and runs, instead of failing on the 800 MB download. + +## 2. Whisper Streaming +**Files:** `js/speech-worker.js` +**What:** Wrapped transcription in a `WhisperTextStreamer` (transformers.js 3.8.1) whose `callback_function` posts `partial` messages as text decodes. The main thread already had a `partial` handler that renders interim text. Falls back to one-shot if the streamer can't be constructed. +**Impact:** Live interim feedback during transcription instead of a blank field for the duration of the clip. + +## 3. WebGPU-Detection Race Fix +**Files:** `js/speechToText.js` +**What:** Added `webGPUResolved`; the detection IIFE now refreshes the engine indicator on completion, and a new `M.speechToText.ready()` resolves once `webGPUPromise` settles. `getEngines()` exposes `webGPUResolved`. +**Impact:** UI/API never report a stale STT engine while detection is mid-flight. (Worker selection was already correctly gated on `webGPUPromise`.) + +## 4. Silent Microphone Failure Fix +**Files:** `js/speechToText.js` +**What:** `startAudioCapture`'s catch block now disconnects partial audio nodes, stops the stream, and shows a toast + interim message instead of only `console.warn`. Detects `NotAllowedError` / `NotFoundError` / `SecurityError` for a clearer "couldn't access the mic" message. +**Impact:** Users are told when the higher-quality engine isn't running, instead of silently getting only Web Speech. + +--- + +## Testing + +- Vite build compiles clean (validates the `WhisperTextStreamer` import and all module syntax). +- `stt-tag.spec.js` + `speech-commands.spec.js`: 22/22 passing, no regressions. +- Verified live: `ready()` resolves with the final engine; tier heuristic correct across boundary cases (≤4 GB → tiny, unknown → turbo). diff --git a/js/speech-worker.js b/js/speech-worker.js index cfe20d3..95c6ac9 100644 --- a/js/speech-worker.js +++ b/js/speech-worker.js @@ -1,11 +1,18 @@ // ============================================ -// speech-worker.js — Whisper Large V3 Turbo ASR WebWorker (WASM fallback) +// speech-worker.js — Whisper ASR WebWorker (WASM fallback) // Used when WebGPU is NOT available. WebGPU devices use voxtral-worker.js. -// Runs textagent/whisper-large-v3-turbo via @huggingface/transformers -// off the main thread for jank-free transcription. -// WER ~7.7% (batched) +// Runs Whisper via @huggingface/transformers off the main thread. +// +// Two tiers (chosen by device capability, or forced by the caller via `tier`): +// • 'turbo' → whisper-large-v3-turbo (~800 MB q8, WER ~7.7%) — default +// • 'tiny' → whisper-tiny (~75 MB q4) — low-end devices +// IMPORTANT: the low-end model is MULTILINGUAL whisper-tiny, NOT tiny.en, so the +// 14-language support is preserved on exactly the devices that fall back to it. +// +// Streaming: a WhisperTextStreamer emits partial text as tokens decode, posted as +// `partial` messages (the main thread renders these as live interim text). // ============================================ -import { pipeline, env } from '@huggingface/transformers'; +import { pipeline, env, WhisperTextStreamer } from '@huggingface/transformers'; // Model host — downloads ONNX models from textagent HuggingFace org const MODEL_HOST = 'https://huggingface.co'; @@ -13,16 +20,43 @@ const MODEL_ORG_FALLBACK = 'onnx-community'; env.remoteHost = MODEL_HOST; let transcriber = null; +let activeTier = 'turbo'; + +// Tier definitions. dtype/model id chosen per tier; both resolve under the +// textagent org first, then fall back to onnx-community. +const TIERS = { + turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' }, + tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…' }, +}; + +// Decide a tier from device capability when the caller doesn't force one. +// Heuristic: low RAM or few cores → the lightweight model. deviceMemory is in GB +// (Chromium-only; undefined elsewhere, in which case we keep the default turbo). +function pickTier() { + const mem = typeof navigator !== 'undefined' ? navigator.deviceMemory : undefined; + const cores = typeof navigator !== 'undefined' ? navigator.hardwareConcurrency : undefined; + if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) { + return 'tiny'; + } + return 'turbo'; +} self.addEventListener('message', async (e) => { const { type, audio } = e.data; if (type === 'init') { try { - self.postMessage({ type: 'status', status: 'loading', message: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' }); + // Caller may force a tier ('tiny' | 'turbo'); otherwise probe the device. + activeTier = (e.data.tier === 'tiny' || e.data.tier === 'turbo') ? e.data.tier : pickTier(); + const tier = TIERS[activeTier]; + + self.postMessage({ type: 'status', status: 'loading', message: tier.dlMsg }); + + // whisperModelId is mutated on org fallback; referenced by the progress callback. + let whisperModelId = tier.id; const pipelineOpts = { - dtype: 'q8', + dtype: tier.dtype, device: 'wasm', progress_callback: (progress) => { if (progress.status === 'progress') { @@ -49,7 +83,6 @@ self.addEventListener('message', async (e) => { }; // Try primary org (textagent), fall back to onnx-community - let whisperModelId = 'textagent/whisper-large-v3-turbo'; try { transcriber = await pipeline( 'automatic-speech-recognition', @@ -70,9 +103,10 @@ self.addEventListener('message', async (e) => { self.postMessage({ type: 'status', status: 'ready', - message: 'Whisper ready', + message: tier.label + ' ready', device: 'CPU (WASM)', - model: 'Whisper V3 Turbo', + model: tier.label, + tier: activeTier, }); } catch (err) { self.postMessage({ type: 'error', message: err.message || String(err) }); @@ -103,9 +137,31 @@ self.addEventListener('message', async (e) => { // Use language from caller, default to 'en' const lang = e.data.lang || 'en'; + + // Stream partial text as tokens decode so the user sees live interim + // results instead of staring at a blank field until the final result. + // WhisperTextStreamer skips special tokens and only emits readable text. + let streamed = ''; + let streamer = null; + try { + streamer = new WhisperTextStreamer(transcriber.tokenizer, { + skip_prompt: true, + callback_function: (partial) => { + streamed += partial; + const t = streamed.trim(); + if (t) self.postMessage({ type: 'partial', text: t }); + }, + }); + } catch (_) { + // If the streamer can't be constructed for any reason, fall back to + // a plain one-shot transcription below (streamer stays null). + streamer = null; + } + const result = await transcriber(normalizedAudio, { language: lang, return_timestamps: false, + streamer: streamer || undefined, }); self.postMessage({ type: 'result', text: result.text }); } catch (err) { diff --git a/js/speechToText.js b/js/speechToText.js index 76147a5..73c3d63 100644 --- a/js/speechToText.js +++ b/js/speechToText.js @@ -18,6 +18,7 @@ // ── WebGPU Detection ────────────────────────── // Determines whether to use Voxtral (WebGPU) or Whisper (WASM) let hasWebGPU = false; + let webGPUResolved = false; // true once detection has completed let sttModelName = 'Whisper V3 Turbo'; // default, updated after detection const webGPUPromise = (async () => { if (typeof navigator !== 'undefined' && navigator.gpu) { @@ -26,8 +27,25 @@ if (adapter) { hasWebGPU = true; sttModelName = 'Voxtral Mini 3B'; } } catch (_) { /* no WebGPU */ } } + webGPUResolved = true; + // Detection finished after the UI may have rendered with the default model + // name — refresh any visible engine label so it reflects the real engine. + try { updateEngineIndicator(); } catch (_) { /* indicator not built yet */ } })(); + // ── Low-end device probe (for the WASM/Whisper fallback only) ── + // On low-RAM / few-core devices the ~800 MB Whisper-Large model is impractical, + // so we ask speech-worker.js to load multilingual whisper-tiny (~75 MB) instead. + // Returns 'tiny' | 'turbo'. WebGPU devices use Voxtral and ignore this. + function pickWhisperTier() { + const mem = navigator.deviceMemory; // GB, Chromium-only + const cores = navigator.hardwareConcurrency; + if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) { + return 'tiny'; + } + return 'turbo'; + } + // ── State ────────────────────────────────────── let isListening = false; let lastInsertTime = Date.now(); @@ -627,7 +645,22 @@ processorNode.connect(audioContext.destination); console.log('🧠 Audio capture started at', audioContext.sampleRate, 'Hz'); } catch (err) { + // The mic for the neural engine (Whisper/Voxtral) is opened separately from the + // Web Speech API's internal stream. On mobile, this second getUserMedia() can be + // denied or fail while Web Speech keeps working — previously this failed silently, + // leaving the user to believe the higher-quality engine was running. Surface it. console.warn('🧠 Audio capture failed:', err); + // Clean up any partially-created nodes/stream so a retry starts fresh. + if (processorNode) { try { processorNode.disconnect(); } catch (_) {} processorNode = null; } + if (sourceNode) { try { sourceNode.disconnect(); } catch (_) {} sourceNode = null; } + if (audioContext) { audioContext.close().catch(() => {}); audioContext = null; } + if (mediaStream) { mediaStream.getTracks().forEach(t => t.stop()); mediaStream = null; } + const denied = err && (err.name === 'NotAllowedError' || err.name === 'NotFoundError' || err.name === 'SecurityError'); + const msg = denied + ? `🎙️ ${sttModelName} couldn't access the mic — using Web Speech only` + : `🎙️ ${sttModelName} audio capture failed — using Web Speech only`; + if (interimText) interimText.textContent = msg; + if (M.showToast) M.showToast(msg, 'warning'); } } @@ -921,8 +954,12 @@ // ── STT Model Download Consent Popup ──────────── function showSttConsentPopup(modelType) { const isVoxtral = modelType === 'voxtral'; - const modelName = isVoxtral ? 'Voxtral Mini 3B' : 'Whisper Large V3 Turbo'; - const downloadSize = isVoxtral ? '~2.7 GB' : '~800 MB'; + // On low-end devices the Whisper path loads the lightweight tiny model — + // reflect that in the consent popup so the size estimate is honest. + const whisperTier = isVoxtral ? null : pickWhisperTier(); + const isTiny = whisperTier === 'tiny'; + const modelName = isVoxtral ? 'Voxtral Mini 3B' : (isTiny ? 'Whisper Tiny' : 'Whisper Large V3 Turbo'); + const downloadSize = isVoxtral ? '~2.7 GB' : (isTiny ? '~75 MB' : '~800 MB'); const deviceLabel = isVoxtral ? 'GPU (WebGPU)' : 'CPU (WASM)'; // Create overlay @@ -964,7 +1001,7 @@ // Proceed with model download initWorker(); if (!modelReady && !modelLoading) { - worker.postMessage({ type: 'init' }); + worker.postMessage({ type: 'init', tier: hasWebGPU ? undefined : pickWhisperTier() }); } }); @@ -1008,7 +1045,8 @@ // Already consented or model cached — proceed directly initWorker(); if (!modelReady && !modelLoading) { - worker.postMessage({ type: 'init' }); + // tier only matters for the Whisper (WASM) worker; Voxtral ignores it + worker.postMessage({ type: 'init', tier: hasWebGPU ? undefined : pickWhisperTier() }); } else if (modelReady) { startAudioCapture(); } @@ -1114,8 +1152,11 @@ sttModel: sttModelName, sttReady: modelReady, webGPU: hasWebGPU, + webGPUResolved: webGPUResolved, aiRefine: aiRefineEnabled, }), + /** Resolves once WebGPU detection has completed (engine choice is final). */ + ready: () => webGPUPromise.then(() => M.speechToText.getEngines()), /** Start recording in card mode — text routes to callbacks instead of editor */ startForCard: (onText, onInterim) => { // Force-stop any active session first (allows re-recording after Clear)