Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions changelogs/CHANGELOG-onnx-community-upgrades.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# ONNX-Community Model Upgrades — Kokoro WebGPU, Moonshine STT, SmolVLM, transformers.js v4

- **Kokoro TTS now runs on WebGPU** when available (with WASM fallback) — fixes the CPU-only 5–15 s/chunk bottleneck; Kokoro v1.0 does ~10 s of speech in ~1 s on GPU. The worker probes WebGPU, loads fp32 on GPU / q8 on WASM, and falls back to WASM if the GPU load fails
- **Moonshine added as a fast English STT engine** (`onnx-community/moonshine-base-ONNX`, MIT) — a new `moonshine` worker tier in `speech-worker.js` plus a "🌙 Moonshine (EN)" option in the `{{@STT:}}` card engine selector; English-only, non-Whisper, so it skips the Whisper streamer/language path and the textagent→onnx-community org fallback
- **SmolVLM lightweight vision models registered** (`onnx-community/SmolVLM-256M/500M-Instruct-ONNX`) with a new `public/ai-worker-smolvlm.js` worker — a ~270–500 MB image-text-to-text alternative to Gemma 4 Vision (~2–4 GB) / Florence-2 for captioning & visual Q&A on low-end devices
- **transformers.js upgraded v3.8.1 → v4.2.0** — the in-browser ML runtime for every local model worker; v4 brings a C++-rewritten WebGPU runtime, ~200 architectures, 53% smaller bundles, and 10× faster builds. Verified: Kokoro TTS, Whisper/Voxtral STT, and OCR all work against v4 (kokoro-js declares a v3 peer dep, but the APIs it uses are unchanged — confirmed by a live end-to-end TTS synthesis)
- Added `M.speechToText.setWhisperTier('moonshine'|'tiny'|'turbo'|null)` to route the WASM worker to a specific tier (resets the worker so the next start reloads)

---

## Summary

A batch of upgrades sourced from the `onnx-community` HuggingFace org, each filtered for browser-runnability (small, permissive, transformers.js-compatible). The headline win is WebGPU acceleration for Kokoro TTS — the audit's #1 audio bottleneck — applied to a model already shipped. Moonshine adds a fast real-time English STT engine; SmolVLM adds a tiny vision option; and the underlying transformers.js runtime is brought to v4, benefiting every local model. The v4 bump was verified with a live Kokoro synthesis (not just a green test suite) because kokoro-js only officially declares v3 support.

---

## 1. Kokoro TTS WebGPU Acceleration
**Files:** `js/tts-worker.js`
**What:** Added `detectWebGPU()` and reworked `loadKokoroManual()` to load on `device:'webgpu'` (fp32) when an adapter is available, falling back to `wasm` (q8) if WebGPU is missing or the GPU load throws. The worker reports its device in the ready message ("Kokoro TTS ready (GPU)").
**Impact:** Multi-speaker/podcast synthesis that took 5–15 s/chunk on CPU now runs at roughly real-time on WebGPU devices.

## 2. Moonshine Fast-English STT
**Files:** `js/speech-worker.js`, `js/speechToText.js`, `js/ai-docgen.js`
**What:** Added a `moonshine` entry to the worker's `TIERS` map (onnx-community, MIT, English-only, no org fallback). The transcribe path is tier-aware: Moonshine does a plain one-shot transcription (no `WhisperTextStreamer`, no `language` option — it's not a Whisper model). Surfaced as a 4th engine in the `{{@STT:}}` card, routed via a new `setWhisperTier()` API. Builds on the low-end Whisper tier added in the prior STT PR.
**Impact:** A fast, real-time English STT engine for users who want speed over multilingual coverage.

## 3. SmolVLM Lightweight Vision
**Files:** `js/ai-models.js`, `public/ai-worker-smolvlm.js` (new)
**What:** Registered `smolvlm-256m` and `smolvlm-500m` model entries (`HuggingFaceTB/SmolVLM-256M-Instruct` / `-500M-Instruct`) and added a dedicated worker mirroring the gemma4 worker's message protocol (setModelId/load/generate/process/ping), using `AutoModelForImageTextToText` + `AutoProcessor` with WebGPU/WASM auto-selection and streaming tokens. SmolVLM ships as three ONNX components, so the worker passes a per-component dtype map (`embed_tokens` fp16, `vision_encoder`/`decoder_model_merged` q4) rather than a single dtype string.
**Impact:** Image captioning and visual Q&A on devices that can't fit the multi-GB Gemma 4 Vision models.
**Verified:** loaded the 256M model on WebGPU in-browser and captioned a test image end-to-end ("The image displays a red circle…") — confirms the vision encoder + decoder pipeline runs.

## 4. transformers.js v4 Upgrade
**Files:** `package.json`, `package-lock.json` (+ all workers importing `@huggingface/transformers`: tts/voxtral/speech/florence/docling/glm-ocr)
**What:** Bumped `@huggingface/transformers` `^3.8.1` → `^4.2.0`. The import paths and APIs the workers use (`pipeline`, `env`, `AutoTokenizer`, `AutoProcessor`, `StyleTextToSpeech2Model`, `Tensor`, `WhisperTextStreamer`) are unchanged across the major. (The CDN-loaded Gemma 4 vision worker was already on v4.0.1.)
**Impact:** Smaller bundle, faster builds, improved WebGPU runtime for every local model.

---

## Testing

- Vite build: clean on v4.
- Playwright: **434 passed** (full smoke + feature suite) on v4, plus TTS/STT/speech suites green. Updated the STT engine-selector test (3 → 4 options) for the new Moonshine entry.
- **Live runtime verification (the important one):** drove a real Kokoro synthesis in the browser against v4 — model downloaded, loaded its 54-voice list, and `speakAsync(...)` resolved with `hasAudio: true`. Confirms kokoro-js works on v4 despite its v3 peer-dep declaration.
- ESLint: no new errors on changed files.
3 changes: 2 additions & 1 deletion js/ai-docgen.js
Original file line number Diff line number Diff line change
Expand Up @@ -1026,7 +1026,7 @@
var sttLangMatch = prompt.match(/^\s*(?:@lang|Lang):\s*(.+)$/mi);
var sttCurrentLang = sttLangMatch ? sttLangMatch[1].trim() : 'en-US';

// Parse @engine (whisper | voxtral | webspeech)
// Parse @engine (whisper | voxtral | webspeech | moonshine)
var sttEngineMatch = prompt.match(/^\s*(?:@engine|Engine):\s*(.+)$/mi);
var sttEngines = M.speechToText && M.speechToText.getEngines ? M.speechToText.getEngines() : {};
var sttDefaultEngine = sttEngines.webGPU ? 'voxtral' : 'whisper';
Expand All @@ -1036,6 +1036,7 @@
var sttEngineOptions = [
{ id: 'whisper', name: '🧠 Whisper V3 Turbo', desc: 'WASM · Offline' },
{ id: 'voxtral', name: '🚀 Voxtral Mini 3B', desc: 'WebGPU · Offline' },
{ id: 'moonshine', name: '🌙 Moonshine (EN)', desc: 'WASM · Fast · English' },
{ id: 'webspeech', name: '🌐 Web Speech API', desc: 'Browser · Online' },
];
var sttEngineHtml = '<select class="ai-stt-engine-select" data-ai-index="' + blockIndex + '" title="STT Engine">';
Expand Down
38 changes: 38 additions & 0 deletions js/ai-models.js
Original file line number Diff line number Diff line change
Expand Up @@ -636,6 +636,44 @@
requiresWebGPU: true,
},

// ── Local: SmolVLM 256M/500M (Hugging Face) — Lightweight Vision ──
// Small vision-language models for image understanding on low-end devices —
// a far lighter alternative to Gemma 4 Vision (~2–4 GB) / Florence-2 for
// basic captioning & visual Q&A. image-text-to-text architecture.
'smolvlm-256m': {
label: 'SmolVLM 256M · Local',
badge: 'SmolVLM 256M · Local',
icon: 'bi bi-image',
statusReady: 'SmolVLM (256M) · Local',
dropdownName: 'SmolVLM (256M)',
dropdownDesc: 'Local · Lightweight Vision · ~270 MB',
isLocal: true,
category: 'local-multimodal',
localModelId: 'HuggingFaceTB/SmolVLM-256M-Instruct',
workerFile: 'ai-worker-smolvlm.js',
downloadSize: '~270 MB',
supportsVision: true,
architecture: 'smolvlm',
dtype: 'q4',
},

'smolvlm-500m': {
label: 'SmolVLM 500M · Local',
badge: 'SmolVLM 500M · Local',
icon: 'bi bi-image',
statusReady: 'SmolVLM (500M) · Local',
dropdownName: 'SmolVLM (500M)',
dropdownDesc: 'Local · Lightweight Vision · ~500 MB',
isLocal: true,
category: 'local-multimodal',
localModelId: 'HuggingFaceTB/SmolVLM-500M-Instruct',
workerFile: 'ai-worker-smolvlm.js',
downloadSize: '~500 MB',
supportsVision: true,
architecture: 'smolvlm',
dtype: 'q4',
},

// ── Local: Kokoro 82M TTS (Text-to-Speech) ────────────
'kokoro-tts': {
label: 'Kokoro TTS · Local',
Expand Down
32 changes: 25 additions & 7 deletions js/speech-worker.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,15 @@ env.remoteHost = MODEL_HOST;
let transcriber = null;
let activeTier = 'turbo';

// Tier definitions. dtype/model id chosen per tier; both resolve under the
// textagent org first, then fall back to onnx-community.
// Tier definitions. dtype/model id chosen per tier.
// turbo/tiny → Whisper (multilingual). `id` is under textagent org first,
// falling back to onnx-community (orgFallback: true).
// moonshine → onnx-community/moonshine-base (MIT), a fast real-time ENGLISH
// model — only hosted on onnx-community, so no org fallback.
const TIERS = {
turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' },
tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…' },
turbo: { id: 'textagent/whisper-large-v3-turbo', dtype: 'q8', label: 'Whisper V3 Turbo', dlMsg: '⏳ Downloading Whisper Large V3 Turbo (WASM)…', orgFallback: true },
tiny: { id: 'textagent/whisper-tiny', dtype: 'q4', label: 'Whisper Tiny', dlMsg: '⏳ Downloading Whisper Tiny (low-end, WASM)…', orgFallback: true },
moonshine: { id: 'onnx-community/moonshine-base-ONNX', dtype: 'q8', label: 'Moonshine Base (EN)', dlMsg: '⏳ Downloading Moonshine Base (fast English, WASM)…', orgFallback: false, englishOnly: true },
};

// Decide a tier from device capability when the caller doesn't force one.
Expand All @@ -46,8 +50,8 @@ self.addEventListener('message', async (e) => {

if (type === 'init') {
try {
// Caller may force a tier ('tiny' | 'turbo'); otherwise probe the device.
activeTier = (e.data.tier === 'tiny' || e.data.tier === 'turbo') ? e.data.tier : pickTier();
// Caller may force a tier ('tiny' | 'turbo' | 'moonshine'); otherwise probe the device.
activeTier = TIERS[e.data.tier] ? e.data.tier : pickTier();
const tier = TIERS[activeTier];

self.postMessage({ type: 'status', status: 'loading', message: tier.dlMsg });
Expand Down Expand Up @@ -82,14 +86,17 @@ self.addEventListener('message', async (e) => {
},
};

// Try primary org (textagent), fall back to onnx-community
// Try primary org (textagent), fall back to onnx-community — but only
// for tiers that are mirrored under textagent. Moonshine lives solely
// on onnx-community, so skip the fallback path for it.
try {
transcriber = await pipeline(
'automatic-speech-recognition',
whisperModelId,
pipelineOpts,
);
} catch (primaryErr) {
if (!tier.orgFallback) throw primaryErr;
console.warn(`textagent model failed: ${primaryErr.message}. Falling back to onnx-community…`);
self.postMessage({ type: 'status', status: 'loading', message: '⚠️ Falling back to onnx-community models…' });
whisperModelId = whisperModelId.replace('textagent/', MODEL_ORG_FALLBACK + '/');
Expand Down Expand Up @@ -135,6 +142,17 @@ self.addEventListener('message', async (e) => {
}
}

// Moonshine is a non-Whisper, English-only ASR model: it has its own
// tokenizer (incompatible with WhisperTextStreamer) and takes no
// `language` option. The Whisper tiers stream partials and pass language.
const isMoonshine = activeTier === 'moonshine';

if (isMoonshine) {
const result = await transcriber(normalizedAudio, { return_timestamps: false });
self.postMessage({ type: 'result', text: result.text });
return;
}

// Use language from caller, default to 'en'
const lang = e.data.lang || 'en';

Expand Down
25 changes: 21 additions & 4 deletions js/speechToText.js
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,15 @@
try { updateEngineIndicator(); } catch (_) { /* indicator not built yet */ }
})();

// ── Low-end device probe (for the WASM/Whisper fallback only) ──
// On low-RAM / few-core devices the ~800 MB Whisper-Large model is impractical,
// so we ask speech-worker.js to load multilingual whisper-tiny (~75 MB) instead.
// Returns 'tiny' | 'turbo'. WebGPU devices use Voxtral and ignore this.
// ── WASM-worker tier selection ──
// A caller (e.g. an {{@STT:}} card set to the Moonshine engine) can force a
// specific worker tier; otherwise we probe the device. On low-RAM / few-core
// devices the ~800 MB Whisper-Large model is impractical, so we load
// multilingual whisper-tiny (~75 MB) instead.
// Returns 'tiny' | 'turbo' | 'moonshine'. WebGPU devices use Voxtral and ignore this.
let forcedTier = null; // 'moonshine' | 'tiny' | 'turbo' | null (auto)
function pickWhisperTier() {
if (forcedTier) return forcedTier;
const mem = navigator.deviceMemory; // GB, Chromium-only
const cores = navigator.hardwareConcurrency;
if ((typeof mem === 'number' && mem <= 4) || (typeof cores === 'number' && cores <= 4)) {
Expand Down Expand Up @@ -1157,6 +1161,19 @@
}),
/** Resolves once WebGPU detection has completed (engine choice is final). */
ready: () => webGPUPromise.then(() => M.speechToText.getEngines()),
/**
* Force the WASM worker tier ('moonshine' | 'tiny' | 'turbo'), or null to
* auto-pick by device. Used by the {{@STT:}} card's engine selector to
* route to Moonshine (fast English). Takes effect on the next worker init,
* so reset the worker if one is already running with a different tier.
*/
setWhisperTier: (tier) => {
const next = (tier === 'moonshine' || tier === 'tiny' || tier === 'turbo') ? tier : null;
if (next === forcedTier) return;
forcedTier = next;
// Drop any worker built for the previous tier so the next start reloads.
if (worker) { try { worker.terminate(); } catch (_) {} worker = null; modelReady = false; modelLoading = false; }
},
/** Start recording in card mode — text routes to callbacks instead of editor */
startForCard: (onText, onInterim) => {
// Force-stop any active session first (allows re-recording after Clear)
Expand Down
52 changes: 43 additions & 9 deletions js/tts-worker.js
Original file line number Diff line number Diff line change
Expand Up @@ -119,20 +119,53 @@ function splitIntoChunks(text, maxLen = 500) {
return chunks.filter(c => c.length > 0);
}

// Track which device the loaded model is running on (for status reporting).
let ttsDevice = 'wasm';

// Probe WebGPU availability inside the worker.
async function detectWebGPU() {
try {
if (typeof navigator === 'undefined' || !navigator.gpu) return false;
const adapter = await navigator.gpu.requestAdapter();
return !!adapter;
} catch (_) {
return false;
}
}

/**
* Load model + tokenizer separately, then construct KokoroTTS directly.
* This avoids the preprocessor_config.json fetch that fails in
* KokoroTTS.from_pretrained() → StyleTextToSpeech2Model.from_pretrained().
*
* Runs on WebGPU when available (Kokoro v1.0 supports it — ~10s of speech in
* ~1s vs 5–15s/chunk on CPU/WASM), falling back to WASM if WebGPU is missing
* or the GPU load fails at runtime. WebGPU prefers fp32 weights; WASM uses q8.
*/
async function loadKokoroManual(modelId, progressCb) {
const model = await StyleTextToSpeech2Model.from_pretrained(modelId, {
dtype: 'q8',
progress_callback: progressCb,
});
const tokenizer = await AutoTokenizer.from_pretrained(modelId, {
progress_callback: progressCb,
});
return new KokoroTTS(model, tokenizer);
const useGPU = await detectWebGPU();

async function build(device, dtype) {
const model = await StyleTextToSpeech2Model.from_pretrained(modelId, {
dtype,
device,
progress_callback: progressCb,
});
const tokenizer = await AutoTokenizer.from_pretrained(modelId, {
progress_callback: progressCb,
});
ttsDevice = device;
return new KokoroTTS(model, tokenizer);
}

if (useGPU) {
try {
return await build('webgpu', 'fp32');
} catch (gpuErr) {
console.warn('[TTS Worker] WebGPU load failed, falling back to WASM:', gpuErr);
}
}
return await build('wasm', 'q8');
}

/**
Expand Down Expand Up @@ -452,7 +485,8 @@ self.addEventListener('message', async (e) => {
self.postMessage({
type: 'status',
status: 'ready',
message: '🔊 Kokoro TTS ready',
message: ttsDevice === 'webgpu' ? '🔊 Kokoro TTS ready (GPU)' : '🔊 Kokoro TTS ready',
device: ttsDevice,
voices,
});

Expand Down
Loading