Textagent · ijbo · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/changelogs/CHANGELOG-onnx-community-upgrades.md b/changelogs/CHANGELOG-onnx-community-upgrades.md
@@ -0,0 +1,45 @@
+# ONNX-Community Model Upgrades — Kokoro WebGPU, Moonshine STT, SmolVLM, transformers.js v4
+
+- **Kokoro TTS now runs on WebGPU** when available (with WASM fallback) — fixes the CPU-only 5–15 s/chunk bottleneck; Kokoro v1.0 does ~10 s of speech in ~1 s on GPU. The worker probes WebGPU, loads fp32 on GPU / q8 on WASM, and falls back to WASM if the GPU load fails
+- **Moonshine added as a fast English STT engine** (`onnx-community/moonshine-base-ONNX`, MIT) — a new `moonshine` worker tier in `speech-worker.js` plus a "🌙 Moonshine (EN)" option in the `{{@STT:}}` card engine selector; English-only, non-Whisper, so it skips the Whisper streamer/language path and the textagent→onnx-community org fallback
+- **SmolVLM lightweight vision models registered** (`onnx-community/SmolVLM-256M/500M-Instruct-ONNX`) with a new `public/ai-worker-smolvlm.js` worker — a ~270–500 MB image-text-to-text alternative to Gemma 4 Vision (~2–4 GB) / Florence-2 for captioning & visual Q&A on low-end devices
+- **transformers.js upgraded v3.8.1 → v4.2.0** — the in-browser ML runtime for every local model worker; v4 brings a C++-rewritten WebGPU runtime, ~200 architectures, 53% smaller bundles, and 10× faster builds. Verified: Kokoro TTS, Whisper/Voxtral STT, and OCR all work against v4 (kokoro-js declares a v3 peer dep, but the APIs it uses are unchanged — confirmed by a live end-to-end TTS synthesis)
+- Added `M.speechToText.setWhisperTier('moonshine'|'tiny'|'turbo'|null)` to route the WASM worker to a specific tier (resets the worker so the next start reloads)
+
+---
+
+## Summary
+
+A batch of upgrades sourced from the `onnx-community` HuggingFace org, each filtered for browser-runnability (small, permissive, transformers.js-compatible). The headline win is WebGPU acceleration for Kokoro TTS — the audit's #1 audio bottleneck — applied to a model already shipped. Moonshine adds a fast real-time English STT engine; SmolVLM adds a tiny vision option; and the underlying transformers.js runtime is brought to v4, benefiting every local model. The v4 bump was verified with a live Kokoro synthesis (not just a green test suite) because kokoro-js only officially declares v3 support.
+
+---
+
+## 1. Kokoro TTS WebGPU Acceleration
+**Files:** `js/tts-worker.js`
+**What:** Added `detectWebGPU()` and reworked `loadKokoroManual()` to load on `device:'webgpu'` (fp32) when an adapter is available, falling back to `wasm` (q8) if WebGPU is missing or the GPU load throws. The worker reports its device in the ready message ("Kokoro TTS ready (GPU)").
+**Impact:** Multi-speaker/podcast synthesis that took 5–15 s/chunk on CPU now runs at roughly real-time on WebGPU devices.
+
+## 2. Moonshine Fast-English STT
+**Files:** `js/speech-worker.js`, `js/speechToText.js`, `js/ai-docgen.js`
+**What:** Added a `moonshine` entry to the worker's `TIERS` map (onnx-community, MIT, English-only, no org fallback). The transcribe path is tier-aware: Moonshine does a plain one-shot transcription (no `WhisperTextStreamer`, no `language` option — it's not a Whisper model). Surfaced as a 4th engine in the `{{@STT:}}` card, routed via a new `setWhisperTier()` API. Builds on the low-end Whisper tier added in the prior STT PR.
+**Impact:** A fast, real-time English STT engine for users who want speed over multilingual coverage.
+
+## 3. SmolVLM Lightweight Vision
+**Files:** `js/ai-models.js`, `public/ai-worker-smolvlm.js` (new)
+**What:** Registered `smolvlm-256m` and `smolvlm-500m` model entries (`HuggingFaceTB/SmolVLM-256M-Instruct` / `-500M-Instruct`) and added a dedicated worker mirroring the gemma4 worker's message protocol (setModelId/load/generate/process/ping), using `AutoModelForImageTextToText` + `AutoProcessor` with WebGPU/WASM auto-selection and streaming tokens. SmolVLM ships as three ONNX components, so the worker passes a per-component dtype map (`embed_tokens` fp16, `vision_encoder`/`decoder_model_merged` q4) rather than a single dtype string.
+**Impact:** Image captioning and visual Q&A on devices that can't fit the multi-GB Gemma 4 Vision models.
+**Verified:** loaded the 256M model on WebGPU in-browser and captioned a test image end-to-end ("The image displays a red circle…") — confirms the vision encoder + decoder pipeline runs.
+
+## 4. transformers.js v4 Upgrade
+**Files:** `package.json`, `package-lock.json` (+ all workers importing `@huggingface/transformers`: tts/voxtral/speech/florence/docling/glm-ocr)
+**What:** Bumped `@huggingface/transformers` `^3.8.1` → `^4.2.0`. The import paths and APIs the workers use (`pipeline`, `env`, `AutoTokenizer`, `AutoProcessor`, `StyleTextToSpeech2Model`, `Tensor`, `WhisperTextStreamer`) are unchanged across the major. (The CDN-loaded Gemma 4 vision worker was already on v4.0.1.)
+**Impact:** Smaller bundle, faster builds, improved WebGPU runtime for every local model.
+
+---
+
+## Testing
+
+- Vite build: clean on v4.
+- Playwright: **434 passed** (full smoke + feature suite) on v4, plus TTS/STT/speech suites green. Updated the STT engine-selector test (3 → 4 options) for the new Moonshine entry.
+- **Live runtime verification (the important one):** drove a real Kokoro synthesis in the browser against v4 — model downloaded, loaded its 54-voice list, and `speakAsync(...)` resolved with `hasAudio: true`. Confirms kokoro-js works on v4 despite its v3 peer-dep declaration.
+- ESLint: no new errors on changed files.
diff --git a/js/ai-docgen.js b/js/ai-docgen.js
@@ -1026,7 +1026,7 @@
                 var sttLangMatch = prompt.match(/^\s*(?:@lang|Lang):\s*(.+)$/mi);
                 var sttCurrentLang = sttLangMatch ? sttLangMatch[1].trim() : 'en-US';
 
-                // Parse @engine (whisper | voxtral | webspeech)
+                // Parse @engine (whisper | voxtral | webspeech | moonshine)
                 var sttEngineMatch = prompt.match(/^\s*(?:@engine|Engine):\s*(.+)$/mi);
                 var sttEngines = M.speechToText && M.speechToText.getEngines ? M.speechToText.getEngines() : {};
                 var sttDefaultEngine = sttEngines.webGPU ? 'voxtral' : 'whisper';
@@ -1036,6 +1036,7 @@
                 var sttEngineOptions = [
                     { id: 'whisper', name: '🧠 Whisper V3 Turbo', desc: 'WASM · Offline' },
                     { id: 'voxtral', name: '🚀 Voxtral Mini 3B', desc: 'WebGPU · Offline' },
+                    { id: 'moonshine', name: '🌙 Moonshine (EN)', desc: 'WASM · Fast · English' },
                     { id: 'webspeech', name: '🌐 Web Speech API', desc: 'Browser · Online' },
                 ];
                 var sttEngineHtml = '<select class="ai-stt-engine-select" data-ai-index="' + blockIndex + '" title="STT Engine">';

diff --git a/js/ai-models.js b/js/ai-models.js
@@ -636,6 +636,44 @@
             requiresWebGPU: true,
         },
 
+        // ── Local: SmolVLM 256M/500M (Hugging Face) — Lightweight Vision ──
+        // Small vision-language models for image understanding on low-end devices —
+        // a far lighter alternative to Gemma 4 Vision (~2–4 GB) / Florence-2 for
+        // basic captioning & visual Q&A. image-text-to-text architecture.
+        'smolvlm-256m': {
+            label: 'SmolVLM 256M · Local',
+            badge: 'SmolVLM 256M · Local',
+            icon: 'bi bi-image',
+            statusReady: 'SmolVLM (256M) · Local',
+            dropdownName: 'SmolVLM (256M)',
+            dropdownDesc: 'Local · Lightweight Vision · ~270 MB',
+            isLocal: true,
+            category: 'local-multimodal',
+            localModelId: 'HuggingFaceTB/SmolVLM-256M-Instruct',
+            workerFile: 'ai-worker-smolvlm.js',
+            downloadSize: '~270 MB',
+            supportsVision: true,
+            architecture: 'smolvlm',
+            dtype: 'q4',
+        },
+
+        'smolvlm-500m': {
+            label: 'SmolVLM 500M · Local',
+            badge: 'SmolVLM 500M · Local',
+            icon: 'bi bi-image',
+            statusReady: 'SmolVLM (500M) · Local',
+            dropdownName: 'SmolVLM (500M)',
+            dropdownDesc: 'Local · Lightweight Vision · ~500 MB',
+            isLocal: true,
+            category: 'local-multimodal',
+            localModelId: 'HuggingFaceTB/SmolVLM-500M-Instruct',
+            workerFile: 'ai-worker-smolvlm.js',
+            downloadSize: '~500 MB',
+            supportsVision: true,
+            architecture: 'smolvlm',
+            dtype: 'q4',
+        },
+
         // ── Local: Kokoro 82M TTS (Text-to-Speech) ────────────
         'kokoro-tts': {
             label: 'Kokoro TTS · Local',

diff --git a/js/speech-worker.js b/js/speech-worker.js
@@ -22,11 +22,15 @@ env.remoteHost = MODEL_HOST;
 let transcriber = null;
 let activeTier = 'turbo';
 
-// Tier definitions. dtype/model id chosen per tier; both resolve under the
-// textagent org first, then fall back to onnx-community.
+// Tier definitions. dtype/model id chosen per tier.
+//   turbo/tiny → Whisper (multilingual). `id` is under textagent org first,
+//                falling back to onnx-community (orgFallback: true).
+//   moonshine  → onnx-community/moonshine-base (MIT), a fast real-time ENGLISH
+//                model — only hosted on onnx-community, so no org fallback.
 const TIERS = {
-    turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' },
-    tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…' },
+    turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…', orgFallback: true },
+    tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…', orgFallback: true },
+    moonshine: { id: 'onnx-community/moonshine-base-ONNX', dtype: 'q8', label: 'Moonshine Base (EN)', dlMsg: '⏳ Downloading Moonshine Base (fast English, WASM)…', orgFallback: false, englishOnly: true },
 };
 
 // Decide a tier from device capability when the caller doesn't force one.
@@ -46,8 +50,8 @@ self.addEventListener('message', async (e) => {
 
     if (type === 'init') {
         try {
-            // Caller may force a tier ('tiny' | 'turbo'); otherwise probe the device.
-            activeTier = (e.data.tier === 'tiny' || e.data.tier === 'turbo') ? e.data.tier : pickTier();
+            // Caller may force a tier ('tiny' | 'turbo' | 'moonshine'); otherwise probe the device.
+            activeTier = TIERS[e.data.tier] ? e.data.tier : pickTier();
             const tier = TIERS[activeTier];
 
             self.postMessage({ type: 'status', status: 'loading', message: tier.dlMsg });
@@ -82,14 +86,17 @@ self.addEventListener('message', async (e) => {
                 },
             };
 
-            // Try primary org (textagent), fall back to onnx-community
+            // Try primary org (textagent), fall back to onnx-community — but only
+            // for tiers that are mirrored under textagent. Moonshine lives solely
+            // on onnx-community, so skip the fallback path for it.
             try {
                 transcriber = await pipeline(
                     'automatic-speech-recognition',
                     whisperModelId,
                     pipelineOpts,
                 );
             } catch (primaryErr) {
+                if (!tier.orgFallback) throw primaryErr;
                 console.warn(`textagent model failed: ${primaryErr.message}. Falling back to onnx-community…`);
                 self.postMessage({ type: 'status', status: 'loading', message: '⚠️ Falling back to onnx-community models…' });
                 whisperModelId = whisperModelId.replace('textagent/', MODEL_ORG_FALLBACK + '/');
@@ -135,6 +142,17 @@ self.addEventListener('message', async (e) => {
                 }
             }
 
+            // Moonshine is a non-Whisper, English-only ASR model: it has its own
+            // tokenizer (incompatible with WhisperTextStreamer) and takes no
+            // `language` option. The Whisper tiers stream partials and pass language.
+            const isMoonshine = activeTier === 'moonshine';
+
+            if (isMoonshine) {
+                const result = await transcriber(normalizedAudio, { return_timestamps: false });
+                self.postMessage({ type: 'result', text: result.text });
+                return;
+            }
+
             // Use language from caller, default to 'en'
             const lang = e.data.lang || 'en';
 

diff --git a/js/speechToText.js b/js/speechToText.js
@@ -33,11 +33,15 @@
         try { updateEngineIndicator(); } catch (_) { /* indicator not built yet */ }
     })();
 
-    // ── Low-end device probe (for the WASM/Whisper fallback only) ──
-    // On low-RAM / few-core devices the ~800 MB Whisper-Large model is impractical,
-    // so we ask speech-worker.js to load multilingual whisper-tiny (~75 MB) instead.
-    // Returns 'tiny' | 'turbo'. WebGPU devices use Voxtral and ignore this.
+    // ── WASM-worker tier selection ──
+    // A caller (e.g. an {{@STT:}} card set to the Moonshine engine) can force a
+    // specific worker tier; otherwise we probe the device. On low-RAM / few-core
+    // devices the ~800 MB Whisper-Large model is impractical, so we load
+    // multilingual whisper-tiny (~75 MB) instead.
+    // Returns 'tiny' | 'turbo' | 'moonshine'. WebGPU devices use Voxtral and ignore this.
+    let forcedTier = null;  // 'moonshine' | 'tiny' | 'turbo' | null (auto)
     function pickWhisperTier() {
+        if (forcedTier) return forcedTier;
         const mem = navigator.deviceMemory;        // GB, Chromium-only
         const cores = navigator.hardwareConcurrency;
         if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) {
@@ -1157,6 +1161,19 @@
         }),
         /** Resolves once WebGPU detection has completed (engine choice is final). */
         ready: () => webGPUPromise.then(() => M.speechToText.getEngines()),
+        /**
+         * Force the WASM worker tier ('moonshine' | 'tiny' | 'turbo'), or null to
+         * auto-pick by device. Used by the {{@STT:}} card's engine selector to
+         * route to Moonshine (fast English). Takes effect on the next worker init,
+         * so reset the worker if one is already running with a different tier.
+         */
+        setWhisperTier: (tier) => {
+            const next = (tier === 'moonshine' || tier === 'tiny' || tier === 'turbo') ? tier : null;
+            if (next === forcedTier) return;
+            forcedTier = next;
+            // Drop any worker built for the previous tier so the next start reloads.
+            if (worker) { try { worker.terminate(); } catch (_) {} worker = null; modelReady = false; modelLoading = false; }
+        },
         /** Start recording in card mode — text routes to callbacks instead of editor */
         startForCard: (onText, onInterim) => {
             // Force-stop any active session first (allows re-recording after Clear)

diff --git a/js/tts-worker.js b/js/tts-worker.js
@@ -119,20 +119,53 @@ function splitIntoChunks(text, maxLen = 500) {
     return chunks.filter(c => c.length > 0);
 }
 
+// Track which device the loaded model is running on (for status reporting).
+let ttsDevice = 'wasm';
+
+// Probe WebGPU availability inside the worker.
+async function detectWebGPU() {
+    try {
+        if (typeof navigator === 'undefined' || !navigator.gpu) return false;
+        const adapter = await navigator.gpu.requestAdapter();
+        return !!adapter;
+    } catch (_) {
+        return false;
+    }
+}
+
 /**
  * Load model + tokenizer separately, then construct KokoroTTS directly.
  * This avoids the preprocessor_config.json fetch that fails in
  * KokoroTTS.from_pretrained() → StyleTextToSpeech2Model.from_pretrained().
+ *
+ * Runs on WebGPU when available (Kokoro v1.0 supports it — ~10s of speech in
+ * ~1s vs 5–15s/chunk on CPU/WASM), falling back to WASM if WebGPU is missing
+ * or the GPU load fails at runtime. WebGPU prefers fp32 weights; WASM uses q8.
  */
 async function loadKokoroManual(modelId, progressCb) {
-    const model = await StyleTextToSpeech2Model.from_pretrained(modelId, {
-        dtype: 'q8',
-        progress_callback: progressCb,
-    });
-    const tokenizer = await AutoTokenizer.from_pretrained(modelId, {
-        progress_callback: progressCb,
-    });
-    return new KokoroTTS(model, tokenizer);
+    const useGPU = await detectWebGPU();
+
+    async function build(device, dtype) {
+        const model = await StyleTextToSpeech2Model.from_pretrained(modelId, {
+            dtype,
+            device,
+            progress_callback: progressCb,
+        });
+        const tokenizer = await AutoTokenizer.from_pretrained(modelId, {
+            progress_callback: progressCb,
+        });
+        ttsDevice = device;
+        return new KokoroTTS(model, tokenizer);
+    }
+
+    if (useGPU) {
+        try {
+            return await build('webgpu', 'fp32');
+        } catch (gpuErr) {
+            console.warn('[TTS Worker] WebGPU load failed, falling back to WASM:', gpuErr);
+        }
+    }
+    return await build('wasm', 'q8');
 }
 
 /**
@@ -452,7 +485,8 @@ self.addEventListener('message', async (e) => {
             self.postMessage({
                 type: 'status',
                 status: 'ready',
-                message: '🔊 Kokoro TTS ready',
+                message: ttsDevice === 'webgpu' ? '🔊 Kokoro TTS ready (GPU)' : '🔊 Kokoro TTS ready',
+                device: ttsDevice,
                 voices,
             });