diff --git a/changelogs/CHANGELOG-onnx-community-upgrades.md b/changelogs/CHANGELOG-onnx-community-upgrades.md
new file mode 100644
index 0000000..c90f589
--- /dev/null
+++ b/changelogs/CHANGELOG-onnx-community-upgrades.md
@@ -0,0 +1,45 @@
+# ONNX-Community Model Upgrades — Kokoro WebGPU, Moonshine STT, SmolVLM, transformers.js v4
+
+- **Kokoro TTS now runs on WebGPU** when available (with WASM fallback) — fixes the CPU-only 5–15 s/chunk bottleneck; Kokoro v1.0 does ~10 s of speech in ~1 s on GPU. The worker probes WebGPU, loads fp32 on GPU / q8 on WASM, and falls back to WASM if the GPU load fails
+- **Moonshine added as a fast English STT engine** (`onnx-community/moonshine-base-ONNX`, MIT) — a new `moonshine` worker tier in `speech-worker.js` plus a "🌙 Moonshine (EN)" option in the `{{@STT:}}` card engine selector; English-only, non-Whisper, so it skips the Whisper streamer/language path and the textagent→onnx-community org fallback
+- **SmolVLM lightweight vision models registered** (`onnx-community/SmolVLM-256M/500M-Instruct-ONNX`) with a new `public/ai-worker-smolvlm.js` worker — a ~270–500 MB image-text-to-text alternative to Gemma 4 Vision (~2–4 GB) / Florence-2 for captioning & visual Q&A on low-end devices
+- **transformers.js upgraded v3.8.1 → v4.2.0** — the in-browser ML runtime for every local model worker; v4 brings a C++-rewritten WebGPU runtime, ~200 architectures, 53% smaller bundles, and 10× faster builds. Verified: Kokoro TTS, Whisper/Voxtral STT, and OCR all work against v4 (kokoro-js declares a v3 peer dep, but the APIs it uses are unchanged — confirmed by a live end-to-end TTS synthesis)
+- Added `M.speechToText.setWhisperTier('moonshine'|'tiny'|'turbo'|null)` to route the WASM worker to a specific tier (resets the worker so the next start reloads)
+
+---
+
+## Summary
+
+A batch of upgrades sourced from the `onnx-community` HuggingFace org, each filtered for browser-runnability (small, permissive, transformers.js-compatible). The headline win is WebGPU acceleration for Kokoro TTS — the audit's #1 audio bottleneck — applied to a model already shipped. Moonshine adds a fast real-time English STT engine; SmolVLM adds a tiny vision option; and the underlying transformers.js runtime is brought to v4, benefiting every local model. The v4 bump was verified with a live Kokoro synthesis (not just a green test suite) because kokoro-js only officially declares v3 support.
+
+---
+
+## 1. Kokoro TTS WebGPU Acceleration
+**Files:** `js/tts-worker.js`
+**What:** Added `detectWebGPU()` and reworked `loadKokoroManual()` to load on `device:'webgpu'` (fp32) when an adapter is available, falling back to `wasm` (q8) if WebGPU is missing or the GPU load throws. The worker reports its device in the ready message ("Kokoro TTS ready (GPU)").
+**Impact:** Multi-speaker/podcast synthesis that took 5–15 s/chunk on CPU now runs at roughly real-time on WebGPU devices.
+
+## 2. Moonshine Fast-English STT
+**Files:** `js/speech-worker.js`, `js/speechToText.js`, `js/ai-docgen.js`
+**What:** Added a `moonshine` entry to the worker's `TIERS` map (onnx-community, MIT, English-only, no org fallback). The transcribe path is tier-aware: Moonshine does a plain one-shot transcription (no `WhisperTextStreamer`, no `language` option — it's not a Whisper model). Surfaced as a 4th engine in the `{{@STT:}}` card, routed via a new `setWhisperTier()` API. Builds on the low-end Whisper tier added in the prior STT PR.
+**Impact:** A fast, real-time English STT engine for users who want speed over multilingual coverage.
+
+## 3. SmolVLM Lightweight Vision
+**Files:** `js/ai-models.js`, `public/ai-worker-smolvlm.js` (new)
+**What:** Registered `smolvlm-256m` and `smolvlm-500m` model entries (`HuggingFaceTB/SmolVLM-256M-Instruct` / `-500M-Instruct`) and added a dedicated worker mirroring the gemma4 worker's message protocol (setModelId/load/generate/process/ping), using `AutoModelForImageTextToText` + `AutoProcessor` with WebGPU/WASM auto-selection and streaming tokens. SmolVLM ships as three ONNX components, so the worker passes a per-component dtype map (`embed_tokens` fp16, `vision_encoder`/`decoder_model_merged` q4) rather than a single dtype string.
+**Impact:** Image captioning and visual Q&A on devices that can't fit the multi-GB Gemma 4 Vision models.
+**Verified:** loaded the 256M model on WebGPU in-browser and captioned a test image end-to-end ("The image displays a red circle…") — confirms the vision encoder + decoder pipeline runs.
+
+## 4. transformers.js v4 Upgrade
+**Files:** `package.json`, `package-lock.json` (+ all workers importing `@huggingface/transformers`: tts/voxtral/speech/florence/docling/glm-ocr)
+**What:** Bumped `@huggingface/transformers` `^3.8.1` → `^4.2.0`. The import paths and APIs the workers use (`pipeline`, `env`, `AutoTokenizer`, `AutoProcessor`, `StyleTextToSpeech2Model`, `Tensor`, `WhisperTextStreamer`) are unchanged across the major. (The CDN-loaded Gemma 4 vision worker was already on v4.0.1.)
+**Impact:** Smaller bundle, faster builds, improved WebGPU runtime for every local model.
+
+---
+
+## Testing
+
+- Vite build: clean on v4.
+- Playwright: **434 passed** (full smoke + feature suite) on v4, plus TTS/STT/speech suites green. Updated the STT engine-selector test (3 → 4 options) for the new Moonshine entry.
+- **Live runtime verification (the important one):** drove a real Kokoro synthesis in the browser against v4 — model downloaded, loaded its 54-voice list, and `speakAsync(...)` resolved with `hasAudio: true`. Confirms kokoro-js works on v4 despite its v3 peer-dep declaration.
+- ESLint: no new errors on changed files.
diff --git a/js/ai-docgen.js b/js/ai-docgen.js
index de63309..188edcf 100644
--- a/js/ai-docgen.js
+++ b/js/ai-docgen.js
@@ -1026,7 +1026,7 @@
                 var sttLangMatch = prompt.match(/^\s*(?:@lang|Lang):\s*(.+)$/mi);
                 var sttCurrentLang = sttLangMatch ? sttLangMatch[1].trim() : 'en-US';
 
-                // Parse @engine (whisper | voxtral | webspeech)
+                // Parse @engine (whisper | voxtral | webspeech | moonshine)
                 var sttEngineMatch = prompt.match(/^\s*(?:@engine|Engine):\s*(.+)$/mi);
                 var sttEngines = M.speechToText && M.speechToText.getEngines ? M.speechToText.getEngines() : {};
                 var sttDefaultEngine = sttEngines.webGPU ? 'voxtral' : 'whisper';
@@ -1036,6 +1036,7 @@
                 var sttEngineOptions = [
                     { id: 'whisper', name: '🧠 Whisper V3 Turbo', desc: 'WASM · Offline' },
                     { id: 'voxtral', name: '🚀 Voxtral Mini 3B', desc: 'WebGPU · Offline' },
+                    { id: 'moonshine', name: '🌙 Moonshine (EN)', desc: 'WASM · Fast · English' },
                     { id: 'webspeech', name: '🌐 Web Speech API', desc: 'Browser · Online' },
                 ];
                 var sttEngineHtml = '<select class="ai-stt-engine-select" data-ai-index="' + blockIndex + '" title="STT Engine">';
diff --git a/js/ai-models.js b/js/ai-models.js
index 75776e2..01e36a4 100644
--- a/js/ai-models.js
+++ b/js/ai-models.js
@@ -636,6 +636,44 @@
             requiresWebGPU: true,
         },
 
+        // ── Local: SmolVLM 256M/500M (Hugging Face) — Lightweight Vision ──
+        // Small vision-language models for image understanding on low-end devices —
+        // a far lighter alternative to Gemma 4 Vision (~2–4 GB) / Florence-2 for
+        // basic captioning & visual Q&A. image-text-to-text architecture.
+        'smolvlm-256m': {
+            label: 'SmolVLM 256M · Local',
+            badge: 'SmolVLM 256M · Local',
+            icon: 'bi bi-image',
+            statusReady: 'SmolVLM (256M) · Local',
+            dropdownName: 'SmolVLM (256M)',
+            dropdownDesc: 'Local · Lightweight Vision · ~270 MB',
+            isLocal: true,
+            category: 'local-multimodal',
+            localModelId: 'HuggingFaceTB/SmolVLM-256M-Instruct',
+            workerFile: 'ai-worker-smolvlm.js',
+            downloadSize: '~270 MB',
+            supportsVision: true,
+            architecture: 'smolvlm',
+            dtype: 'q4',
+        },
+
+        'smolvlm-500m': {
+            label: 'SmolVLM 500M · Local',
+            badge: 'SmolVLM 500M · Local',
+            icon: 'bi bi-image',
+            statusReady: 'SmolVLM (500M) · Local',
+            dropdownName: 'SmolVLM (500M)',
+            dropdownDesc: 'Local · Lightweight Vision · ~500 MB',
+            isLocal: true,
+            category: 'local-multimodal',
+            localModelId: 'HuggingFaceTB/SmolVLM-500M-Instruct',
+            workerFile: 'ai-worker-smolvlm.js',
+            downloadSize: '~500 MB',
+            supportsVision: true,
+            architecture: 'smolvlm',
+            dtype: 'q4',
+        },
+
         // ── Local: Kokoro 82M TTS (Text-to-Speech) ────────────
         'kokoro-tts': {
             label: 'Kokoro TTS · Local',
diff --git a/js/speech-worker.js b/js/speech-worker.js
index 95c6ac9..ecd090a 100644
--- a/js/speech-worker.js
+++ b/js/speech-worker.js
@@ -22,11 +22,15 @@ env.remoteHost = MODEL_HOST;
 let transcriber = null;
 let activeTier = 'turbo';
 
-// Tier definitions. dtype/model id chosen per tier; both resolve under the
-// textagent org first, then fall back to onnx-community.
+// Tier definitions. dtype/model id chosen per tier.
+//   turbo/tiny → Whisper (multilingual). `id` is under textagent org first,
+//                falling back to onnx-community (orgFallback: true).
+//   moonshine  → onnx-community/moonshine-base (MIT), a fast real-time ENGLISH
+//                model — only hosted on onnx-community, so no org fallback.
 const TIERS = {
-    turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' },
-    tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…' },
+    turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…', orgFallback: true },
+    tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…', orgFallback: true },
+    moonshine: { id: 'onnx-community/moonshine-base-ONNX', dtype: 'q8', label: 'Moonshine Base (EN)', dlMsg: '⏳ Downloading Moonshine Base (fast English, WASM)…', orgFallback: false, englishOnly: true },
 };
 
 // Decide a tier from device capability when the caller doesn't force one.
@@ -46,8 +50,8 @@ self.addEventListener('message', async (e) => {
 
     if (type === 'init') {
         try {
-            // Caller may force a tier ('tiny' | 'turbo'); otherwise probe the device.
-            activeTier = (e.data.tier === 'tiny' || e.data.tier === 'turbo') ? e.data.tier : pickTier();
+            // Caller may force a tier ('tiny' | 'turbo' | 'moonshine'); otherwise probe the device.
+            activeTier = TIERS[e.data.tier] ? e.data.tier : pickTier();
             const tier = TIERS[activeTier];
 
             self.postMessage({ type: 'status', status: 'loading', message: tier.dlMsg });
@@ -82,7 +86,9 @@ self.addEventListener('message', async (e) => {
                 },
             };
 
-            // Try primary org (textagent), fall back to onnx-community
+            // Try primary org (textagent), fall back to onnx-community — but only
+            // for tiers that are mirrored under textagent. Moonshine lives solely
+            // on onnx-community, so skip the fallback path for it.
             try {
                 transcriber = await pipeline(
                     'automatic-speech-recognition',
@@ -90,6 +96,7 @@ self.addEventListener('message', async (e) => {
                     pipelineOpts,
                 );
             } catch (primaryErr) {
+                if (!tier.orgFallback) throw primaryErr;
                 console.warn(`textagent model failed: ${primaryErr.message}. Falling back to onnx-community…`);
                 self.postMessage({ type: 'status', status: 'loading', message: '⚠️ Falling back to onnx-community models…' });
                 whisperModelId = whisperModelId.replace('textagent/', MODEL_ORG_FALLBACK + '/');
@@ -135,6 +142,17 @@ self.addEventListener('message', async (e) => {
                 }
             }
 
+            // Moonshine is a non-Whisper, English-only ASR model: it has its own
+            // tokenizer (incompatible with WhisperTextStreamer) and takes no
+            // `language` option. The Whisper tiers stream partials and pass language.
+            const isMoonshine = activeTier === 'moonshine';
+
+            if (isMoonshine) {
+                const result = await transcriber(normalizedAudio, { return_timestamps: false });
+                self.postMessage({ type: 'result', text: result.text });
+                return;
+            }
+
             // Use language from caller, default to 'en'
             const lang = e.data.lang || 'en';
 
diff --git a/js/speechToText.js b/js/speechToText.js
index 73c3d63..1a3f67b 100644
--- a/js/speechToText.js
+++ b/js/speechToText.js
@@ -33,11 +33,15 @@
         try { updateEngineIndicator(); } catch (_) { /* indicator not built yet */ }
     })();
 
-    // ── Low-end device probe (for the WASM/Whisper fallback only) ──
-    // On low-RAM / few-core devices the ~800 MB Whisper-Large model is impractical,
-    // so we ask speech-worker.js to load multilingual whisper-tiny (~75 MB) instead.
-    // Returns 'tiny' | 'turbo'. WebGPU devices use Voxtral and ignore this.
+    // ── WASM-worker tier selection ──
+    // A caller (e.g. an {{@STT:}} card set to the Moonshine engine) can force a
+    // specific worker tier; otherwise we probe the device. On low-RAM / few-core
+    // devices the ~800 MB Whisper-Large model is impractical, so we load
+    // multilingual whisper-tiny (~75 MB) instead.
+    // Returns 'tiny' | 'turbo' | 'moonshine'. WebGPU devices use Voxtral and ignore this.
+    let forcedTier = null;  // 'moonshine' | 'tiny' | 'turbo' | null (auto)
     function pickWhisperTier() {
+        if (forcedTier) return forcedTier;
         const mem = navigator.deviceMemory;        // GB, Chromium-only
         const cores = navigator.hardwareConcurrency;
         if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) {
@@ -1157,6 +1161,19 @@
         }),
         /** Resolves once WebGPU detection has completed (engine choice is final). */
         ready: () => webGPUPromise.then(() => M.speechToText.getEngines()),
+        /**
+         * Force the WASM worker tier ('moonshine' | 'tiny' | 'turbo'), or null to
+         * auto-pick by device. Used by the {{@STT:}} card's engine selector to
+         * route to Moonshine (fast English). Takes effect on the next worker init,
+         * so reset the worker if one is already running with a different tier.
+         */
+        setWhisperTier: (tier) => {
+            const next = (tier === 'moonshine' || tier === 'tiny' || tier === 'turbo') ? tier : null;
+            if (next === forcedTier) return;
+            forcedTier = next;
+            // Drop any worker built for the previous tier so the next start reloads.
+            if (worker) { try { worker.terminate(); } catch (_) {} worker = null; modelReady = false; modelLoading = false; }
+        },
         /** Start recording in card mode — text routes to callbacks instead of editor */
         startForCard: (onText, onInterim) => {
             // Force-stop any active session first (allows re-recording after Clear)
diff --git a/js/tts-worker.js b/js/tts-worker.js
index 08e08db..f8a625b 100644
--- a/js/tts-worker.js
+++ b/js/tts-worker.js
@@ -119,20 +119,53 @@ function splitIntoChunks(text, maxLen = 500) {
     return chunks.filter(c => c.length > 0);
 }
 
+// Track which device the loaded model is running on (for status reporting).
+let ttsDevice = 'wasm';
+
+// Probe WebGPU availability inside the worker.
+async function detectWebGPU() {
+    try {
+        if (typeof navigator === 'undefined' || !navigator.gpu) return false;
+        const adapter = await navigator.gpu.requestAdapter();
+        return !!adapter;
+    } catch (_) {
+        return false;
+    }
+}
+
 /**
  * Load model + tokenizer separately, then construct KokoroTTS directly.
  * This avoids the preprocessor_config.json fetch that fails in
  * KokoroTTS.from_pretrained() → StyleTextToSpeech2Model.from_pretrained().
+ *
+ * Runs on WebGPU when available (Kokoro v1.0 supports it — ~10s of speech in
+ * ~1s vs 5–15s/chunk on CPU/WASM), falling back to WASM if WebGPU is missing
+ * or the GPU load fails at runtime. WebGPU prefers fp32 weights; WASM uses q8.
  */
 async function loadKokoroManual(modelId, progressCb) {
-    const model = await StyleTextToSpeech2Model.from_pretrained(modelId, {
-        dtype: 'q8',
-        progress_callback: progressCb,
-    });
-    const tokenizer = await AutoTokenizer.from_pretrained(modelId, {
-        progress_callback: progressCb,
-    });
-    return new KokoroTTS(model, tokenizer);
+    const useGPU = await detectWebGPU();
+
+    async function build(device, dtype) {
+        const model = await StyleTextToSpeech2Model.from_pretrained(modelId, {
+            dtype,
+            device,
+            progress_callback: progressCb,
+        });
+        const tokenizer = await AutoTokenizer.from_pretrained(modelId, {
+            progress_callback: progressCb,
+        });
+        ttsDevice = device;
+        return new KokoroTTS(model, tokenizer);
+    }
+
+    if (useGPU) {
+        try {
+            return await build('webgpu', 'fp32');
+        } catch (gpuErr) {
+            console.warn('[TTS Worker] WebGPU load failed, falling back to WASM:', gpuErr);
+        }
+    }
+    return await build('wasm', 'q8');
 }
 
 /**
@@ -452,7 +485,8 @@ self.addEventListener('message', async (e) => {
             self.postMessage({
                 type: 'status',
                 status: 'ready',
-                message: '🔊 Kokoro TTS ready',
+                message: ttsDevice === 'webgpu' ? '🔊 Kokoro TTS ready (GPU)' : '🔊 Kokoro TTS ready',
+                device: ttsDevice,
                 voices,
             });
 
diff --git a/package-lock.json b/package-lock.json
index 58cc92c..38d89ae 100644
--- a/package-lock.json
+++ b/package-lock.json
@@ -10,7 +10,7 @@
       "license": "MIT",
       "dependencies": {
         "@chenglou/pretext": "^0.0.4",
-        "@huggingface/transformers": "^3.8.1",
+        "@huggingface/transformers": "^4.2.0",
         "bootstrap": "5.3.8",
         "bootstrap-icons": "1.13.1",
         "dompurify": "3.0.9",
@@ -737,34 +737,41 @@
         "node": ">=18"
       }
     },
+    "node_modules/@huggingface/tokenizers": {
+      "version": "0.1.3",
+      "resolved": "https://registry.npmjs.org/@huggingface/tokenizers/-/tokenizers-0.1.3.tgz",
+      "integrity": "sha512-8rF/RRT10u+kn7YuUbUg0OF30K8rjTc78aHpxT+qJ1uWSqxT1MHi8+9ltwYfkFYJzT/oS+qw3JVfHtNMGAdqyA==",
+      "license": "Apache-2.0"
+    },
     "node_modules/@huggingface/transformers": {
-      "version": "3.8.1",
-      "resolved": "https://registry.npmjs.org/@huggingface/transformers/-/transformers-3.8.1.tgz",
-      "integrity": "sha512-tsTk4zVjImqdqjS8/AOZg2yNLd1z9S5v+7oUPpXaasDRwEDhB+xnglK1k5cad26lL5/ZIaeREgWWy0bs9y9pPA==",
+      "version": "4.2.0",
+      "resolved": "https://registry.npmjs.org/@huggingface/transformers/-/transformers-4.2.0.tgz",
+      "integrity": "sha512-8BRCoBMH0XsWaEIamuR0LrJGAfftgHAfb2Vrffy0VKlSAE/MnUJ5/h/zTfEP3fDIft+nk7TqB8xXEyABGitBjQ==",
       "license": "Apache-2.0",
       "dependencies": {
-        "@huggingface/jinja": "^0.5.3",
-        "onnxruntime-node": "1.21.0",
-        "onnxruntime-web": "1.22.0-dev.20250409-89f8206ba4",
-        "sharp": "^0.34.1"
+        "@huggingface/jinja": "^0.5.6",
+        "@huggingface/tokenizers": "^0.1.3",
+        "onnxruntime-node": "1.24.3",
+        "onnxruntime-web": "1.26.0-dev.20260416-b7804b056c",
+        "sharp": "^0.34.5"
       }
     },
     "node_modules/@huggingface/transformers/node_modules/onnxruntime-common": {
-      "version": "1.22.0-dev.20250409-89f8206ba4",
-      "resolved": "https://registry.npmjs.org/onnxruntime-common/-/onnxruntime-common-1.22.0-dev.20250409-89f8206ba4.tgz",
-      "integrity": "sha512-vDJMkfCfb0b1A836rgHj+ORuZf4B4+cc2bASQtpeoJLueuFc5DuYwjIZUBrSvx/fO5IrLjLz+oTrB3pcGlhovQ==",
+      "version": "1.24.0-dev.20251116-b39e144322",
+      "resolved": "https://registry.npmjs.org/onnxruntime-common/-/onnxruntime-common-1.24.0-dev.20251116-b39e144322.tgz",
+      "integrity": "sha512-BOoomdHYmNRL5r4iQ4bMvsl2t0/hzVQ3OM3PHD0gxeXu1PmggqBv3puZicEUVOA3AtHHYmqZtjMj9FOfGrATTw==",
       "license": "MIT"
     },
     "node_modules/@huggingface/transformers/node_modules/onnxruntime-web": {
-      "version": "1.22.0-dev.20250409-89f8206ba4",
-      "resolved": "https://registry.npmjs.org/onnxruntime-web/-/onnxruntime-web-1.22.0-dev.20250409-89f8206ba4.tgz",
-      "integrity": "sha512-0uS76OPgH0hWCPrFKlL8kYVV7ckM7t/36HfbgoFw6Nd0CZVVbQC4PkrR8mBX8LtNUFZO25IQBqV2Hx2ho3FlbQ==",
+      "version": "1.26.0-dev.20260416-b7804b056c",
+      "resolved": "https://registry.npmjs.org/onnxruntime-web/-/onnxruntime-web-1.26.0-dev.20260416-b7804b056c.tgz",
+      "integrity": "sha512-MD6Ss4GSpQBo6zqoJzyT9LRbKYs7x/JVN23FT24EcEvlqF4VuzPOeH6X38orZPKHQDbprn7K+SBpu0/mj2CQiw==",
       "license": "MIT",
       "dependencies": {
         "flatbuffers": "^25.1.24",
         "guid-typescript": "^1.0.9",
         "long": "^5.2.3",
-        "onnxruntime-common": "1.22.0-dev.20250409-89f8206ba4",
+        "onnxruntime-common": "1.24.0-dev.20251116-b39e144322",
         "platform": "^1.3.6",
         "protobufjs": "^7.2.4"
       }
@@ -2156,6 +2163,15 @@
         "node": ">=0.8"
       }
     },
+    "node_modules/adm-zip": {
+      "version": "0.5.17",
+      "resolved": "https://registry.npmjs.org/adm-zip/-/adm-zip-0.5.17.tgz",
+      "integrity": "sha512-+Ut8d9LLqwEvHHJl1+PIHqoyDxFgVN847JTVM3Izi3xHDWPE4UtzzXysMZQs64DMcrJfBeS/uoEP4AD3HQHnQQ==",
+      "license": "MIT",
+      "engines": {
+        "node": ">=12.0"
+      }
+    },
     "node_modules/ajv": {
       "version": "6.14.0",
       "resolved": "https://registry.npmjs.org/ajv/-/ajv-6.14.0.tgz",
@@ -3949,6 +3965,61 @@
         "phonemizer": "^1.2.1"
       }
     },
+    "node_modules/kokoro-js/node_modules/@huggingface/transformers": {
+      "version": "3.8.1",
+      "resolved": "https://registry.npmjs.org/@huggingface/transformers/-/transformers-3.8.1.tgz",
+      "integrity": "sha512-tsTk4zVjImqdqjS8/AOZg2yNLd1z9S5v+7oUPpXaasDRwEDhB+xnglK1k5cad26lL5/ZIaeREgWWy0bs9y9pPA==",
+      "license": "Apache-2.0",
+      "dependencies": {
+        "@huggingface/jinja": "^0.5.3",
+        "onnxruntime-node": "1.21.0",
+        "onnxruntime-web": "1.22.0-dev.20250409-89f8206ba4",
+        "sharp": "^0.34.1"
+      }
+    },
+    "node_modules/kokoro-js/node_modules/onnxruntime-common": {
+      "version": "1.21.0",
+      "resolved": "https://registry.npmjs.org/onnxruntime-common/-/onnxruntime-common-1.21.0.tgz",
+      "integrity": "sha512-Q632iLLrtCAVOTO65dh2+mNbQir/QNTVBG3h/QdZBpns7mZ0RYbLRBgGABPbpU9351AgYy7SJf1WaeVwMrBFPQ==",
+      "license": "MIT"
+    },
+    "node_modules/kokoro-js/node_modules/onnxruntime-node": {
+      "version": "1.21.0",
+      "resolved": "https://registry.npmjs.org/onnxruntime-node/-/onnxruntime-node-1.21.0.tgz",
+      "integrity": "sha512-NeaCX6WW2L8cRCSqy3bInlo5ojjQqu2fD3D+9W5qb5irwxhEyWKXeH2vZ8W9r6VxaMPUan+4/7NDwZMtouZxEw==",
+      "hasInstallScript": true,
+      "license": "MIT",
+      "os": [
+        "win32",
+        "darwin",
+        "linux"
+      ],
+      "dependencies": {
+        "global-agent": "^3.0.0",
+        "onnxruntime-common": "1.21.0",
+        "tar": "^7.0.1"
+      }
+    },
+    "node_modules/kokoro-js/node_modules/onnxruntime-web": {
+      "version": "1.22.0-dev.20250409-89f8206ba4",
+      "resolved": "https://registry.npmjs.org/onnxruntime-web/-/onnxruntime-web-1.22.0-dev.20250409-89f8206ba4.tgz",
+      "integrity": "sha512-0uS76OPgH0hWCPrFKlL8kYVV7ckM7t/36HfbgoFw6Nd0CZVVbQC4PkrR8mBX8LtNUFZO25IQBqV2Hx2ho3FlbQ==",
+      "license": "MIT",
+      "dependencies": {
+        "flatbuffers": "^25.1.24",
+        "guid-typescript": "^1.0.9",
+        "long": "^5.2.3",
+        "onnxruntime-common": "1.22.0-dev.20250409-89f8206ba4",
+        "platform": "^1.3.6",
+        "protobufjs": "^7.2.4"
+      }
+    },
+    "node_modules/kokoro-js/node_modules/onnxruntime-web/node_modules/onnxruntime-common": {
+      "version": "1.22.0-dev.20250409-89f8206ba4",
+      "resolved": "https://registry.npmjs.org/onnxruntime-common/-/onnxruntime-common-1.22.0-dev.20250409-89f8206ba4.tgz",
+      "integrity": "sha512-vDJMkfCfb0b1A836rgHj+ORuZf4B4+cc2bASQtpeoJLueuFc5DuYwjIZUBrSvx/fO5IrLjLz+oTrB3pcGlhovQ==",
+      "license": "MIT"
+    },
     "node_modules/kolorist": {
       "version": "1.8.0",
       "resolved": "https://registry.npmjs.org/kolorist/-/kolorist-1.8.0.tgz",
@@ -4287,15 +4358,15 @@
       }
     },
     "node_modules/onnxruntime-common": {
-      "version": "1.21.0",
-      "resolved": "https://registry.npmjs.org/onnxruntime-common/-/onnxruntime-common-1.21.0.tgz",
-      "integrity": "sha512-Q632iLLrtCAVOTO65dh2+mNbQir/QNTVBG3h/QdZBpns7mZ0RYbLRBgGABPbpU9351AgYy7SJf1WaeVwMrBFPQ==",
+      "version": "1.24.3",
+      "resolved": "https://registry.npmjs.org/onnxruntime-common/-/onnxruntime-common-1.24.3.tgz",
+      "integrity": "sha512-GeuPZO6U/LBJXvwdaqHbuUmoXiEdeCjWi/EG7Y1HNnDwJYuk6WUbNXpF6luSUY8yASul3cmUlLGrCCL1ZgVXqA==",
       "license": "MIT"
     },
     "node_modules/onnxruntime-node": {
-      "version": "1.21.0",
-      "resolved": "https://registry.npmjs.org/onnxruntime-node/-/onnxruntime-node-1.21.0.tgz",
-      "integrity": "sha512-NeaCX6WW2L8cRCSqy3bInlo5ojjQqu2fD3D+9W5qb5irwxhEyWKXeH2vZ8W9r6VxaMPUan+4/7NDwZMtouZxEw==",
+      "version": "1.24.3",
+      "resolved": "https://registry.npmjs.org/onnxruntime-node/-/onnxruntime-node-1.24.3.tgz",
+      "integrity": "sha512-JH7+czbc8ALA819vlTgcV+Q214/+VjGeBHDjX81+ZCD0PCVCIFGFNtT0V4sXG/1JXypKPgScQcB3ij/hk3YnTg==",
       "hasInstallScript": true,
       "license": "MIT",
       "os": [
@@ -4304,9 +4375,9 @@
         "linux"
       ],
       "dependencies": {
+        "adm-zip": "^0.5.16",
         "global-agent": "^3.0.0",
-        "onnxruntime-common": "1.21.0",
-        "tar": "^7.0.1"
+        "onnxruntime-common": "1.24.3"
       }
     },
     "node_modules/onnxruntime-web": {
@@ -4323,12 +4394,6 @@
         "protobufjs": "^7.2.4"
       }
     },
-    "node_modules/onnxruntime-web/node_modules/onnxruntime-common": {
-      "version": "1.24.3",
-      "resolved": "https://registry.npmjs.org/onnxruntime-common/-/onnxruntime-common-1.24.3.tgz",
-      "integrity": "sha512-GeuPZO6U/LBJXvwdaqHbuUmoXiEdeCjWi/EG7Y1HNnDwJYuk6WUbNXpF6luSUY8yASul3cmUlLGrCCL1ZgVXqA==",
-      "license": "MIT"
-    },
     "node_modules/option": {
       "version": "0.2.4",
       "resolved": "https://registry.npmjs.org/option/-/option-0.2.4.tgz",
@@ -5001,9 +5066,9 @@
       }
     },
     "node_modules/tar": {
-      "version": "7.5.11",
-      "resolved": "https://registry.npmjs.org/tar/-/tar-7.5.11.tgz",
-      "integrity": "sha512-ChjMH33/KetonMTAtpYdgUFr0tbz69Fp2v7zWxQfYZX4g5ZN2nOBXm1R2xyA+lMIKrLKIoKAwFj93jE/avX9cQ==",
+      "version": "7.5.16",
+      "resolved": "https://registry.npmjs.org/tar/-/tar-7.5.16.tgz",
+      "integrity": "sha512-56adEpPMouktRlBLXiaYFFzZ/3+JXa8P9n7WbR+ibIjtviN55mEaOkiysCnPnWm+7kkui1Dn8J9l+g6zV8731w==",
       "license": "BlueOak-1.0.0",
       "dependencies": {
         "@isaacs/fs-minipass": "^4.0.0",
diff --git a/package.json b/package.json
index 04c810a..60a68c5 100644
--- a/package.json
+++ b/package.json
@@ -33,7 +33,7 @@
   "license": "MIT",
   "dependencies": {
     "@chenglou/pretext": "^0.0.4",
-    "@huggingface/transformers": "^3.8.1",
+    "@huggingface/transformers": "^4.2.0",
     "bootstrap": "5.3.8",
     "bootstrap-icons": "1.13.1",
     "dompurify": "3.0.9",
diff --git a/public/ai-worker-smolvlm.js b/public/ai-worker-smolvlm.js
new file mode 100644
index 0000000..66fef18
--- /dev/null
+++ b/public/ai-worker-smolvlm.js
@@ -0,0 +1,170 @@
+// ============================================
+// ai-worker-smolvlm.js — SmolVLM (256M / 500M) Lightweight Vision Worker
+// Models: HuggingFaceTB/SmolVLM-256M-Instruct, HuggingFaceTB/SmolVLM-500M-Instruct
+// Supports: text + image (image-text-to-text). A far lighter alternative to
+// Gemma 4 Vision (~2–4 GB) and Florence-2 for captioning & visual Q&A.
+//
+// Mirrors the message protocol of ai-worker-gemma4.js (setModelId / load /
+// generate / process / ping) so it drops into the same model-loading pipeline.
+// ============================================
+
+// SmolVLM (Idefics3 architecture) is supported in transformers.js v4.
+const TRANSFORMERS_URL = "https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.1";
+
+let MODEL_ID = "HuggingFaceTB/SmolVLM-256M-Instruct";
+let MODEL_LABEL = "SmolVLM 256M";
+
+// Dynamically imported from transformers.js
+let AutoProcessor, AutoModelForImageTextToText, load_image, TextStreamer;
+
+let model = null;
+let processor = null;
+
+self.postMessage({ type: "status", message: `[SmolVLM] Loading transformers.js ${TRANSFORMERS_URL.split('@').pop()}...` });
+
+function makeProgressCb(label) {
+    return (progress) => {
+        if (progress.status === "progress") {
+            self.postMessage({
+                type: "progress",
+                file: progress.file || label,
+                loaded: progress.loaded || 0,
+                total: progress.total || 0,
+                progress: progress.progress || 0,
+                source: MODEL_ID,
+            });
+        } else if (progress.status === "initiate") {
+            self.postMessage({ type: "status", message: `Loading ${progress.file || label}...`, source: MODEL_ID, loadingPhase: "initiate" });
+        } else if (progress.status === "done") {
+            self.postMessage({ type: "status", message: `Loaded ${progress.file || label} ✓`, source: MODEL_ID, loadingPhase: "done" });
+        }
+    };
+}
+
+async function loadModel() {
+    try {
+        self.postMessage({ type: "status", message: `Initializing ${MODEL_LABEL}...` });
+
+        const transformers = await import(TRANSFORMERS_URL);
+        AutoProcessor = transformers.AutoProcessor;
+        AutoModelForImageTextToText = transformers.AutoModelForImageTextToText;
+        load_image = transformers.load_image;
+        TextStreamer = transformers.TextStreamer;
+
+        // WebGPU when available, WASM otherwise.
+        let device = "wasm";
+        if (typeof navigator !== "undefined" && navigator.gpu) {
+            try {
+                const adapter = await navigator.gpu.requestAdapter();
+                if (adapter) device = "webgpu";
+            } catch (_) { /* keep wasm */ }
+        }
+        self.postMessage({ type: "status", message: `Using ${device.toUpperCase()} backend...` });
+
+        self.postMessage({ type: "status", message: `Loading ${MODEL_LABEL} processor...` });
+        processor = await AutoProcessor.from_pretrained(MODEL_ID, {
+            progress_callback: makeProgressCb("processor"),
+        });
+
+        self.postMessage({ type: "status", message: `Loading ${MODEL_LABEL} model (${device.toUpperCase()})...` });
+        // SmolVLM ships as three ONNX components; transformers.js needs a per-component
+        // dtype map (a single string mis-resolves the merged decoder). This mirrors the
+        // official transformers.js SmolVLM WebGPU example.
+        model = await AutoModelForImageTextToText.from_pretrained(MODEL_ID, {
+            dtype: {
+                embed_tokens: "fp16",
+                vision_encoder: device === "webgpu" ? "q4" : "fp16",
+                decoder_model_merged: "q4",
+            },
+            device: device,
+            progress_callback: makeProgressCb("model"),
+        });
+
+        self.postMessage({ type: "loaded", device: device });
+    } catch (error) {
+        self.postMessage({ type: "error", message: `Failed to load ${MODEL_LABEL}: ${error.message}` });
+    }
+}
+
+async function generate({ userPrompt, prompt, attachments = [], context, messageId, options = {} }) {
+    const userText = userPrompt || prompt || context || "Describe this image.";
+
+    if (!model || !processor) {
+        self.postMessage({ type: "error", message: "SmolVLM not loaded yet. Please wait for the model to finish loading.", messageId });
+        return;
+    }
+
+    try {
+        self.postMessage({ type: "status", message: "Processing...", messageId });
+
+        // Build the user turn — image tokens first, then text (Idefics3 order).
+        const userContent = [];
+        let loadedImage = null;
+        const imageAtts = (attachments || []).filter(a => a.type === "image");
+        for (const att of imageAtts) {
+            let url = att.data;
+            if (url && !url.startsWith("data:") && !url.startsWith("http")) {
+                url = `data:${att.mimeType || "image/png"};base64,${url}`;
+            }
+            userContent.push({ type: "image" });
+            if (!loadedImage) loadedImage = await load_image(url);
+        }
+        userContent.push({ type: "text", text: userText });
+
+        const messages = [{ role: "user", content: userContent }];
+
+        const formattedPrompt = processor.apply_chat_template(messages, { add_generation_prompt: true });
+        const inputs = await processor(formattedPrompt, loadedImage || undefined, { add_special_tokens: false });
+
+        let fullText = "";
+        const streamer = new TextStreamer(processor.tokenizer, {
+            skip_prompt: true,
+            skip_special_tokens: true,
+            callback_function: (token) => {
+                fullText += token;
+                self.postMessage({ type: "token", token, messageId });
+            },
+        });
+
+        await model.generate({
+            ...inputs,
+            max_new_tokens: options.maxTokens || 1024,
+            do_sample: false,
+            streamer,
+        });
+
+        self.postMessage({ type: "complete", text: fullText.trim(), messageId });
+    } catch (error) {
+        self.postMessage({ type: "error", message: `SmolVLM generation failed: ${error.message}`, messageId });
+    }
+}
+
+self.addEventListener("message", async (event) => {
+    const { type } = event.data;
+    switch (type) {
+        case "setModelId":
+            MODEL_ID = event.data.modelId || MODEL_ID;
+            MODEL_LABEL = event.data.modelLabel || MODEL_LABEL;
+            break;
+        case "load":
+            await loadModel();
+            break;
+        case "generate":
+            await generate(event.data);
+            break;
+        case "process":
+            await generate({
+                prompt: event.data.prompt || event.data.task,
+                attachments: event.data.attachments || [],
+                context: event.data.context,
+                messageId: event.data.messageId,
+                options: event.data.options || {},
+            });
+            break;
+        case "ping":
+            self.postMessage({ type: "pong" });
+            break;
+        default:
+            console.warn("SmolVLM worker — unknown message type:", type);
+    }
+});
diff --git a/tests/feature/stt-tag.spec.js b/tests/feature/stt-tag.spec.js
index bfef796..8468990 100644
--- a/tests/feature/stt-tag.spec.js
+++ b/tests/feature/stt-tag.spec.js
@@ -73,7 +73,7 @@ test.describe('STT Tag Block', () => {
         expect(card).toBe(true);
     });
 
-    test('STT card has engine selector with 3 options', async ({ page }) => {
+    test('STT card has engine selector with 4 options', async ({ page }) => {
         await page.locator('#markdown-editor').fill('{{@STT:\n  @lang: en-US\n}}');
         await page.waitForTimeout(2000);
 
@@ -83,10 +83,10 @@ test.describe('STT Tag Block', () => {
             const select = preview.querySelector('.ai-stt-engine-select');
             return select ? select.querySelectorAll('option').length : 0;
         });
-        expect(optionCount).toBe(3);
+        expect(optionCount).toBe(4);
     });
 
-    test('STT card engine options include whisper, voxtral, webspeech', async ({ page }) => {
+    test('STT card engine options include whisper, voxtral, moonshine, webspeech', async ({ page }) => {
         await page.locator('#markdown-editor').fill('{{@STT:\n  @lang: en-US\n}}');
         await page.waitForTimeout(2000);
 
@@ -99,6 +99,7 @@ test.describe('STT Tag Block', () => {
         });
         expect(values).toContain('whisper');
         expect(values).toContain('voxtral');
+        expect(values).toContain('moonshine');
         expect(values).toContain('webspeech');
     });