Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/docker-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
tags: |
type=ref,event=branch
type=ref,event=pr
type=sha,prefix={{branch}}-
type=sha,prefix=sha-
type=raw,value=latest,enable={{is_default_branch}}

- name: Build and push Docker image
Expand Down
11 changes: 11 additions & 0 deletions changelogs/CHANGELOG-docker-tag-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Fix invalid Docker image tag on PR builds

- Fixed the `build-and-push` GitHub Actions job failing on pull requests with `invalid tag "...:-<sha>": invalid reference format`
- Root cause: `docker/metadata-action` used `type=sha,prefix={{branch}}-`, but `{{branch}}` resolves to an empty string on `pull_request` events, producing a tag with a leading hyphen (`:-b0503a8`) β€” which Docker rejects
- Changed the SHA tag prefix to a static `sha-`, valid across branch pushes, PRs, and the default branch; branch/PR identity is still captured by the separate `type=ref,event=branch` and `type=ref,event=pr` tag rules

---

## Summary

The Docker publish workflow couldn't build on any pull request because the SHA-based image tag was constructed with `{{branch}}-`, and the `branch` template is empty on PR events β€” yielding an invalid `:-<sha>` tag. Using a constant `sha-` prefix fixes PR builds without losing branch/PR information (the ref-based tags already encode that).
44 changes: 44 additions & 0 deletions changelogs/CHANGELOG-stt-audio-improvements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Speech-to-Text Improvements β€” Low-End Tier, Streaming & Reliability Fixes

- Added a **low-end device tier** to the Whisper (WASM) fallback: on low-RAM / few-core devices (`navigator.deviceMemory ≀ 4 GB` or `hardwareConcurrency ≀ 4`), `speech-worker.js` now loads **multilingual `whisper-tiny`** (~75 MB, q4) instead of `whisper-large-v3-turbo` (~800 MB) β€” giving working dictation on Chromebooks / older phones that previously couldn't load any STT model
- Used **multilingual `whisper-tiny`, not `tiny.en`** β€” the 14-language support is preserved on exactly the devices that fall back to it
- Added **streaming partial results** to the Whisper path via `WhisperTextStreamer` β€” interim text now appears as tokens decode instead of a blank field until the final result (closes the UX-parity gap with Voxtral, which already streamed)
- **Fixed (race):** WebGPU detection was async + fire-and-forget, so `sttModelName` and the engine indicator could show a stale value before detection resolved; added a `webGPUResolved` flag, a post-detection indicator refresh, and a public `M.speechToText.ready()` that resolves once the engine choice is final
- **Fixed (silent failure):** the neural engine's microphone (`getUserMedia` in `startAudioCapture`) is opened separately from the Web Speech API's internal stream; on mobile the second request can be denied while Web Speech keeps working, previously failing silently β€” it now cleans up partial state and surfaces a toast + interim message ("using Web Speech only"), distinguishing permission-denied from other failures
- Consent popup now reflects the chosen tier (shows "~75 MB / Whisper Tiny" on low-end devices instead of always "~800 MB")

---

## Summary

The WASM speech-to-text fallback previously hard-loaded the ~800 MB Whisper-Large-V3-Turbo model with one-shot (non-streaming) results and no path for low-end devices. This adds a device-capability-aware tier that loads a ~75 MB multilingual Whisper-Tiny on constrained hardware, streams partial transcriptions as they decode, and fixes two reliability issues found in an audit: a WebGPU-detection race that could surface the wrong engine name, and a silent microphone failure on mobile that left users believing the neural engine was running when only Web Speech was.

---

## 1. Low-End STT Tier
**Files:** `js/speech-worker.js`, `js/speechToText.js`
**What:** Added a `TIERS` map (`turbo` β†’ whisper-large-v3-turbo q8; `tiny` β†’ whisper-tiny q4) and a `pickTier()` device probe in the worker, mirrored by `pickWhisperTier()` on the main thread. The caller passes a `tier` hint in the `init` message (only for the WASM/Whisper path; Voxtral/WebGPU ignores it). Unknown devices (non-Chromium, no `deviceMemory`) safely default to `turbo` so capable machines are never downgraded.
**Impact:** Devices with ≀4 GB RAM or ≀4 cores get a ~75 MB model that actually loads and runs, instead of failing on the 800 MB download.

## 2. Whisper Streaming
**Files:** `js/speech-worker.js`
**What:** Wrapped transcription in a `WhisperTextStreamer` (transformers.js 3.8.1) whose `callback_function` posts `partial` messages as text decodes. The main thread already had a `partial` handler that renders interim text. Falls back to one-shot if the streamer can't be constructed.
**Impact:** Live interim feedback during transcription instead of a blank field for the duration of the clip.

## 3. WebGPU-Detection Race Fix
**Files:** `js/speechToText.js`
**What:** Added `webGPUResolved`; the detection IIFE now refreshes the engine indicator on completion, and a new `M.speechToText.ready()` resolves once `webGPUPromise` settles. `getEngines()` exposes `webGPUResolved`.
**Impact:** UI/API never report a stale STT engine while detection is mid-flight. (Worker selection was already correctly gated on `webGPUPromise`.)

## 4. Silent Microphone Failure Fix
**Files:** `js/speechToText.js`
**What:** `startAudioCapture`'s catch block now disconnects partial audio nodes, stops the stream, and shows a toast + interim message instead of only `console.warn`. Detects `NotAllowedError` / `NotFoundError` / `SecurityError` for a clearer "couldn't access the mic" message.
**Impact:** Users are told when the higher-quality engine isn't running, instead of silently getting only Web Speech.

---

## Testing

- Vite build compiles clean (validates the `WhisperTextStreamer` import and all module syntax).
- `stt-tag.spec.js` + `speech-commands.spec.js`: 22/22 passing, no regressions.
- Verified live: `ready()` resolves with the final engine; tier heuristic correct across boundary cases (≀4 GB β†’ tiny, unknown β†’ turbo).
76 changes: 66 additions & 10 deletions js/speech-worker.js
Original file line number Diff line number Diff line change
@@ -1,28 +1,62 @@
// ============================================
// speech-worker.js β€” Whisper Large V3 Turbo ASR WebWorker (WASM fallback)
// speech-worker.js β€” Whisper ASR WebWorker (WASM fallback)
// Used when WebGPU is NOT available. WebGPU devices use voxtral-worker.js.
// Runs textagent/whisper-large-v3-turbo via @huggingface/transformers
// off the main thread for jank-free transcription.
// WER ~7.7% (batched)
// Runs Whisper via @huggingface/transformers off the main thread.
//
// Two tiers (chosen by device capability, or forced by the caller via `tier`):
// β€’ 'turbo' β†’ whisper-large-v3-turbo (~800 MB q8, WER ~7.7%) β€” default
// β€’ 'tiny' β†’ whisper-tiny (~75 MB q4) β€” low-end devices
// IMPORTANT: the low-end model is MULTILINGUAL whisper-tiny, NOT tiny.en, so the
// 14-language support is preserved on exactly the devices that fall back to it.
//
// Streaming: a WhisperTextStreamer emits partial text as tokens decode, posted as
// `partial` messages (the main thread renders these as live interim text).
// ============================================
import { pipeline, env } from '@huggingface/transformers';
import { pipeline, env, WhisperTextStreamer } from '@huggingface/transformers';

// Model host β€” downloads ONNX models from textagent HuggingFace org
const MODEL_HOST = 'https://huggingface.co';
const MODEL_ORG_FALLBACK = 'onnx-community';
env.remoteHost = MODEL_HOST;

let transcriber = null;
let activeTier = 'turbo';

// Tier definitions. dtype/model id chosen per tier; both resolve under the
// textagent org first, then fall back to onnx-community.
const TIERS = {
turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' },
tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…' },
};

// Decide a tier from device capability when the caller doesn't force one.
// Heuristic: low RAM or few cores β†’ the lightweight model. deviceMemory is in GB
// (Chromium-only; undefined elsewhere, in which case we keep the default turbo).
function pickTier() {
const mem = typeof navigator !== 'undefined' ? navigator.deviceMemory : undefined;
const cores = typeof navigator !== 'undefined' ? navigator.hardwareConcurrency : undefined;
if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) {
return 'tiny';
}
return 'turbo';
}

self.addEventListener('message', async (e) => {
const { type, audio } = e.data;

if (type === 'init') {
try {
self.postMessage({ type: 'status', status: 'loading', message: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' });
// Caller may force a tier ('tiny' | 'turbo'); otherwise probe the device.
activeTier = (e.data.tier === 'tiny' || e.data.tier === 'turbo') ? e.data.tier : pickTier();
const tier = TIERS[activeTier];

self.postMessage({ type: 'status', status: 'loading', message: tier.dlMsg });

// whisperModelId is mutated on org fallback; referenced by the progress callback.
let whisperModelId = tier.id;

const pipelineOpts = {
dtype: 'q8',
dtype: tier.dtype,
device: 'wasm',
progress_callback: (progress) => {
if (progress.status === 'progress') {
Expand All @@ -49,7 +83,6 @@ self.addEventListener('message', async (e) => {
};

// Try primary org (textagent), fall back to onnx-community
let whisperModelId = 'textagent/whisper-large-v3-turbo';
try {
transcriber = await pipeline(
'automatic-speech-recognition',
Expand All @@ -70,9 +103,10 @@ self.addEventListener('message', async (e) => {
self.postMessage({
type: 'status',
status: 'ready',
message: 'Whisper ready',
message: tier.label + ' ready',
device: 'CPU (WASM)',
model: 'Whisper V3 Turbo',
model: tier.label,
tier: activeTier,
});
} catch (err) {
self.postMessage({ type: 'error', message: err.message || String(err) });
Expand Down Expand Up @@ -103,9 +137,31 @@ self.addEventListener('message', async (e) => {

// Use language from caller, default to 'en'
const lang = e.data.lang || 'en';

// Stream partial text as tokens decode so the user sees live interim
// results instead of staring at a blank field until the final result.
// WhisperTextStreamer skips special tokens and only emits readable text.
let streamed = '';
let streamer = null;
try {
streamer = new WhisperTextStreamer(transcriber.tokenizer, {
skip_prompt: true,
callback_function: (partial) => {
streamed += partial;
const t = streamed.trim();
if (t) self.postMessage({ type: 'partial', text: t });
},
});
} catch (_) {
// If the streamer can't be constructed for any reason, fall back to
// a plain one-shot transcription below (streamer stays null).
streamer = null;
}

const result = await transcriber(normalizedAudio, {
language: lang,
return_timestamps: false,
streamer: streamer || undefined,
});
self.postMessage({ type: 'result', text: result.text });
} catch (err) {
Expand Down
49 changes: 45 additions & 4 deletions js/speechToText.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
// ── WebGPU Detection ──────────────────────────
// Determines whether to use Voxtral (WebGPU) or Whisper (WASM)
let hasWebGPU = false;
let webGPUResolved = false; // true once detection has completed
let sttModelName = 'Whisper V3 Turbo'; // default, updated after detection
const webGPUPromise = (async () => {
if (typeof navigator !== 'undefined' && navigator.gpu) {
Expand All @@ -26,8 +27,25 @@
if (adapter) { hasWebGPU = true; sttModelName = 'Voxtral Mini 3B'; }
} catch (_) { /* no WebGPU */ }
}
webGPUResolved = true;
// Detection finished after the UI may have rendered with the default model
// name β€” refresh any visible engine label so it reflects the real engine.
try { updateEngineIndicator(); } catch (_) { /* indicator not built yet */ }
})();

// ── Low-end device probe (for the WASM/Whisper fallback only) ──
// On low-RAM / few-core devices the ~800 MB Whisper-Large model is impractical,
// so we ask speech-worker.js to load multilingual whisper-tiny (~75 MB) instead.
// Returns 'tiny' | 'turbo'. WebGPU devices use Voxtral and ignore this.
function pickWhisperTier() {
const mem = navigator.deviceMemory; // GB, Chromium-only
const cores = navigator.hardwareConcurrency;
if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) {
return 'tiny';
}
return 'turbo';
}

// ── State ──────────────────────────────────────
let isListening = false;
let lastInsertTime = Date.now();
Expand Down Expand Up @@ -627,7 +645,22 @@
processorNode.connect(audioContext.destination);
console.log('🧠 Audio capture started at', audioContext.sampleRate, 'Hz');
} catch (err) {
// The mic for the neural engine (Whisper/Voxtral) is opened separately from the
// Web Speech API's internal stream. On mobile, this second getUserMedia() can be
// denied or fail while Web Speech keeps working β€” previously this failed silently,
// leaving the user to believe the higher-quality engine was running. Surface it.
console.warn('🧠 Audio capture failed:', err);
// Clean up any partially-created nodes/stream so a retry starts fresh.
if (processorNode) { try { processorNode.disconnect(); } catch (_) {} processorNode = null; }
if (sourceNode) { try { sourceNode.disconnect(); } catch (_) {} sourceNode = null; }
if (audioContext) { audioContext.close().catch(() => {}); audioContext = null; }
if (mediaStream) { mediaStream.getTracks().forEach(t => t.stop()); mediaStream = null; }
const denied = err && (err.name === 'NotAllowedError' || err.name === 'NotFoundError' || err.name === 'SecurityError');
const msg = denied
? `πŸŽ™οΈ ${sttModelName} couldn't access the mic β€” using Web Speech only`
: `πŸŽ™οΈ ${sttModelName} audio capture failed β€” using Web Speech only`;
if (interimText) interimText.textContent = msg;
if (M.showToast) M.showToast(msg, 'warning');
}
}

Expand Down Expand Up @@ -921,8 +954,12 @@
// ── STT Model Download Consent Popup ────────────
function showSttConsentPopup(modelType) {
const isVoxtral = modelType === 'voxtral';
const modelName = isVoxtral ? 'Voxtral Mini 3B' : 'Whisper Large V3 Turbo';
const downloadSize = isVoxtral ? '~2.7 GB' : '~800 MB';
// On low-end devices the Whisper path loads the lightweight tiny model β€”
// reflect that in the consent popup so the size estimate is honest.
const whisperTier = isVoxtral ? null : pickWhisperTier();
const isTiny = whisperTier === 'tiny';
const modelName = isVoxtral ? 'Voxtral Mini 3B' : (isTiny ? 'Whisper Tiny' : 'Whisper Large V3 Turbo');
const downloadSize = isVoxtral ? '~2.7 GB' : (isTiny ? '~75 MB' : '~800 MB');
const deviceLabel = isVoxtral ? 'GPU (WebGPU)' : 'CPU (WASM)';

// Create overlay
Expand Down Expand Up @@ -964,7 +1001,7 @@
// Proceed with model download
initWorker();
if (!modelReady && !modelLoading) {
worker.postMessage({ type: 'init' });
worker.postMessage({ type: 'init', tier: hasWebGPU ? undefined : pickWhisperTier() });
}
});

Expand Down Expand Up @@ -1008,7 +1045,8 @@
// Already consented or model cached β€” proceed directly
initWorker();
if (!modelReady && !modelLoading) {
worker.postMessage({ type: 'init' });
// tier only matters for the Whisper (WASM) worker; Voxtral ignores it
worker.postMessage({ type: 'init', tier: hasWebGPU ? undefined : pickWhisperTier() });
} else if (modelReady) {
startAudioCapture();
}
Expand Down Expand Up @@ -1114,8 +1152,11 @@
sttModel: sttModelName,
sttReady: modelReady,
webGPU: hasWebGPU,
webGPUResolved: webGPUResolved,
aiRefine: aiRefineEnabled,
}),
/** Resolves once WebGPU detection has completed (engine choice is final). */
ready: () => webGPUPromise.then(() => M.speechToText.getEngines()),
/** Start recording in card mode β€” text routes to callbacks instead of editor */
startForCard: (onText, onInterim) => {
// Force-stop any active session first (allows re-recording after Clear)
Expand Down
Loading