Famous Vision Language Models and Their Architectures
-
Updated
Jan 11, 2026 - Markdown
Famous Vision Language Models and Their Architectures
ComfyUI-QwenVL custom node: Integrates the Qwen-VL series, including Qwen2.5-VL and the latest Qwen3-VL, with GGUF support for advanced multimodal AI in text generation, image understanding, and video analysis.
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Fast GPU OCR server. 270 img/s on FUNSD. TensorRT FP16, PP-OCRv5, HTTP + gRPC.
Reinforcement Learning of Vision Language Models with Self Visual Perception Reward
Mark web pages for use with vision-language models
Local Video RAG Engine. A FastAPI microservice for video understanding: Scene Detection + Whisper ASR + Qwen3-VL. Optimized for Apple Silicon (MLX) & Windows/Linux (Llama.cpp).
An AI Agent that is able to control your screen to complste any task
Self-evolving agentic reward framework for image-editing evaluation — 47.4% on EditReward-Bench from only 100 preference demos, no reward-model training. arXiv 2605.08703.
给 DeepSeek 装上眼睛 — MCP Server + 通义千问VL, 剪贴板图片→视觉模型→文字描述 / Give DeepSeek the ability to see images via clipboard + Qwen-VL
🎬 Extract AI prompts from video using Vision LLM (llama.cpp API) — Gradio WebUI + CLI
DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
A robotic sequential grasping system integrating YOLO detection and Qwen-VLM fine-tuning, enabling a full loop from manual teaching to LLM-based logical manipulation.
Qwen-VL base model for use with Autodistill.
Run Qwen3-8B at ~18 tok/s on a Core Ultra laptop iGPU. Local-only LLM + VLM + ReAct agent stack with $0 token cost. Drop-in Claude Code backend.
🤖 The Next-Gen AI Agent. Unlike normal agents, it goes beyond text and can control your Desktop & Android.
Enable local integration of Qwen3.5 models with ComfyUI for text generation and multimodal visual tasks, featuring automatic model management and precision control.
creates text from video and audio using Qwen-VL and Whisper
Add a description, image, and links to the qwen-vl topic page so that developers can more easily learn about it.
To associate your repository with the qwen-vl topic, visit your repo's landing page and select "manage topics."