Skip to content

feat: onnx-community upgrades — Kokoro WebGPU, Moonshine STT, SmolVLM, transformers.js v4#6

Merged
ijbo merged 2 commits into
mainfrom
feat/onnx-community-upgrades-v2
Jun 22, 2026
Merged

feat: onnx-community upgrades — Kokoro WebGPU, Moonshine STT, SmolVLM, transformers.js v4#6
ijbo merged 2 commits into
mainfrom
feat/onnx-community-upgrades-v2

Conversation

@ijbo

@ijbo ijbo commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What

Four upgrades sourced from the onnx-community HuggingFace org, each filtered for browser-runnability (small, permissive, transformers.js-compatible).

Base branch: this is stacked on top of fix/stt-audio-improvements (PR #5) — it extends the STT tier system added there. Merge #5 first, or set this PR's base to fix/stt-audio-improvements.

Changes

🔊 1. Kokoro TTS on WebGPU (tts-worker.js)

Kokoro now loads on device:'webgpu' (fp32) when an adapter is available, falling back to WASM (q8). Fixes the audit's #1 audio bottleneck — CPU-only synthesis at 5–15 s/chunk; on GPU, Kokoro v1.0 does ~10 s of speech in ~1 s. You already ship this model; it's a runtime change.

🌙 2. Moonshine fast-English STT (speech-worker.js, ai-docgen.js)

onnx-community/moonshine-base-ONNX (MIT) added as a moonshine worker tier and a 4th {{@STT:}} engine option. English-only, non-Whisper — so it skips the Whisper streamer/language path and the org fallback. Routed via a new setWhisperTier() API.

🖼️ 3. SmolVLM lightweight vision (ai-models.js, ai-worker-smolvlm.js)

SmolVLM-256M/500M-Instruct-ONNX registered with a new worker (image-text-to-text, WebGPU/WASM, streaming) — a ~270–500 MB alternative to Gemma 4 Vision (~2–4 GB) / Florence-2 for low-end devices.

⚙️ 4. transformers.js v3.8.1 → v4.2.0 (package.json)

The in-browser ML runtime for every local model worker. v4 brings a C++ WebGPU runtime, ~200 architectures, 53% smaller bundles, 10× faster builds. The APIs the workers use are unchanged across the major.

Testing

  • Vite build: clean on v4.
  • Playwright: 434 passed (full smoke + feature suite) on v4; TTS/STT/speech suites green. Updated STT engine-selector test (3 → 4 options).
  • Live runtime verification (the important one for v4): drove a real Kokoro synthesis in the browser against v4 — the model downloaded, loaded its 54-voice list, and speakAsync(...) resolved with hasAudio: true. This confirms kokoro-js works on v4 despite declaring a v3 peer dep.
  • ESLint: no new errors.

Notes

  • kokoro-js@1.2.1 (latest) declares transformers@^3.5.1 as a peer dep — it doesn't officially support v4, but the APIs it calls are unchanged and a live synthesis confirms it works. Flagging for awareness.

🤖 Generated with Claude Code

ijbo and others added 2 commits June 22, 2026 19:24
…, transformers.js v4

- Kokoro TTS runs on WebGPU when available (WASM fallback) — fixes the CPU-only
  5-15s/chunk bottleneck (tts-worker.js)
- Moonshine added as a fast English STT engine (onnx-community/moonshine-base,
  MIT): new worker tier + {{@stt:}} card option + setWhisperTier() API
- SmolVLM 256M/500M registered with a new public/ai-worker-smolvlm.js worker —
  a lightweight image-text-to-text alternative to Gemma 4 Vision / Florence-2
- transformers.js bumped 3.8.1 -> 4.2.0 (every local model worker). APIs used
  are unchanged; verified Kokoro TTS synthesizes live on v4 despite kokoro-js's
  v3 peer-dep declaration.

Verified: vite build clean; 434 Playwright tests pass on v4; live Kokoro
synthesis resolved with audio. STT engine-selector test updated 3 -> 4 options.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e map

Live testing caught wrong repo IDs (onnx-community/SmolVLM-*-ONNX 404'd with
"Unauthorized access"). The ONNX exports live under HuggingFaceTB/SmolVLM-256M-Instruct
and -500M-Instruct. Also switched to a per-component dtype map (embed_tokens fp16,
vision_encoder/decoder_model_merged q4) since SmolVLM ships as three ONNX files.

Verified end-to-end: loaded 256M on WebGPU and captioned a test image correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ijbo ijbo changed the base branch from fix/stt-audio-improvements to main June 22, 2026 13:08
@ijbo ijbo merged commit b0d86df into main Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant