Flexible KV cache reuse for knowledge-intensive LLM serving
Built on vLLM and LMCache. Designed for compute-network-aware knowledge injection across LLM systems.
Why CacheRoute? • Features • Architecture • Quick Start • API • Docs
CacheRoute is a lightweight LLM scheduling framework built on vLLM and LMCache to enable flexible KV cache reuse across LLM systems. It targets knowledge-intensive LLM services, such as browser AI and knowledge QA systems, where many requests repeatedly use the same external knowledge. Existing systems usually prepend long knowledge texts to the user question and send the whole prompt to the model for recomputation. Although this approach helps reduce model hallucination and improve answer quality, it introduces heavy prefill overhead and causes redundant computation when the same knowledge appears across many requests.
CacheRoute addresses this problem by using KDN servers to store KVCache blocks for popular knowledge. For each request, CacheRoute dynamically chooses between text-based injection and KVCache-based injection according to task queues, compute load, and network load. In this way, CacheRoute shifts knowledge injection cost between compute and network resources, improving task latency and system throughput.
- 🚀 Less redundant prefill computation: reuse repeated knowledge through KV cache instead of recomputing long prompts.
- 🔁 Cross-system KV cache reuse: share reusable knowledge across LLM systems through KDN servers.
- 🌐 Compute-network coordination: dynamically choose between recomputation and KV cache injection based on real-time resource load.
CacheRoute reduces average TTFT, improves system throughput, and enables more effective KVCache reuse under knowledge-intensive workloads.
| Feature | Description |
|---|---|
| ⚙️ Compute-network-aware knowledge injection | CacheRoute dynamically chooses between text recomputation and KVCache reuse. It predicts task cost at the proxy and selects the injection strategy based on current task queues, compute load, and network load. |
| 🧭 Knowledge-oriented cross-system routing | CacheRoute parses the knowledge requirement before resource-pool scheduling. The scheduler jointly considers knowledge availability, system load, and topology information, and routes requests to the LLM system that can serve the required knowledge more efficiently. |
| 🗂️ KDN-based KV cache management | CacheRoute follows Knowledge Delivery Networks' idea, using dedicated KDN servers to register, store, query, and inject KV cache blocks for reusable knowledge. This enables external knowledge to be reused across LLM systems instead of being repeatedly recomputed. |
CacheRoute separates global routing, local injection decision, and KV cache management into Scheduler, Proxy, Instance, and KDN Server.
- Scheduler: performs global resource-pool selection and knowledge-oriented task routing.
- Proxy: manages local task queues and selects the knowledge injection strategy.
- Instance: connects CacheRoute with vLLM + LMCache and handles execution signaling.
- KDN Server: stores reusable knowledge and injects KVCache blocks when needed.
| Component | Service Plane | Control Plane |
|---|---|---|
| Scheduler | 7001 | 7002 |
| Proxy | 8001 | 8002 |
| Instance | 9001 | - |
| vLLM | 8000 | - |
| KDN Server | 9101 | - |
- The Client sends an OpenAI-compatible request to the Scheduler.
- The Scheduler analyzes the knowledge requirement and selects a target resource pool.
- The Proxy predicts the cost of text-based and KVCache-based injection.
- The KDN Server injects reusable KVCache blocks when KVCache reuse is selected.
- The Instance forwards the request to vLLM + LMCache and returns the response.
CacheRoute has been tested with the following core environment:
| Component | Version |
|---|---|
| Python | 3.12.11 |
| vLLM | 0.13.x |
| LMCache | 0.3.x |
| PyTorch | 2.9.x |
| Redis | 7 |
| CUDA GPUs | Required for full LLM serving |
Install Python dependencies with:
pip install -r requirements.txtCacheRoute provides two ways to get started.
Use the demo scripts to understand the CacheRoute scheduling workflow.
cd test
python3 demo_scheduler.py --cacheroute
python3 demo_kdn.py
python3 demo_proxy.py --strategy round_robin --injection-strategy iws --ready-release-policy text_bypass
python3 demo_instance.py --port 9001 --host 127.0.0.1
python3 demo_client.py --with-uiFor full deployment with vLLM, LMCache, Redis, KDN warm-up, and KVCache injection, see:
env/README.mdfor environment setup.kdn_server/README.mdfor KDN registration and KVCache injection.core/README.mdfor multi-machine configuration.
Full single-machine deployment guide
-
Place the whole CacheRoute project under
/workspace/. -
Create a new container that supports vLLM. The required image is
cacheroute:vllm0.13-lmcache3.11-pytorch2.9.1built from source. If you do not know how to quickly deploy the CacheRoute environment or download models, see/env/README.md.sudo docker run --gpus all -it --name CacheRoute --network host --ipc=host --shm-size=64g --ulimit memlock=-1 --ulimit stack=67108864 --memory=0 --memory-swap=0 -p 8000:8000 -v /llm-stack:/workspace/llm-stack cacheroute:vllm0.13-lmcache3.11-pytorch2.9.1 bash -
Start and enter the container. This is useful when you need to open multiple container terminals.
sudo docker start CacheRoute sudo docker exec -it CacheRoute bashFirst, start a Redis container as the later KVCache store for
LMcache_connector.sudo docker run -d --name lmcache-redis --network host redis:7 redis-server --bind 0.0.0.0 --protected-mode no --save "" --appendonly no --maxmemory 200gb --maxmemory-policy allkeys-lru -
Configure the required parameters in
core/config.pyaccording to the actual model download paths. The Scheduler strongly depends on the embedding model, tokenizer, and LLM model.DEFAULT_MODEL: Path of the LLM to run DEFAULT_MODEL_SHORTNAME: Short name of the LLM, used by later vLLM startup commands SCHEDULER/PROXY/INSTANCE/KDN_LOG_FILE: Log output paths of Scheduler/proxy/instance/kdn, <path-to-Cacheroute/log/**> EMBEDDING_MODEL: Actual path of the locally downloaded embedding model, <path-to-Cacheroute/model/embedder/**> DEFAULT_EMBED_MODEL: Embedding model name, used to download from Hugging Face when EMBEDDING_MODEL is not configured ...There are also many other parameters. See
core/config.pyfor detailed descriptions, and seetest/demo_***for usage examples.
4.2 To enable KVCache reuse across containers, CacheRoute replaces the unstablebuiltin+SEEDkey generation method withsha256_cbor. However, because of output format mismatch, CacheRoute patchestoken_database.py. Therefore, you need to replacelmcache/v1/token_database.pyandlmcache/v1/memory_management.pyin the LMCache source code withCacheRoute/env/token_database.pyandCacheRoute/env/memory_management.py.
4.3 CacheRoute supports interconnection and scheduling across multi-level inference resource pools. For a quick demo on a single device, this tutorial uses a single-machine setup. It connectsscheduler,proxy,instance, andkdn_serverthrough loopback addresses and separates modules by ports. For multi-machine experiments, you need to modify the related configurations inconfig.pyanddemo. Seecore/README.mdfor details. -
To enable the TTFT predictor in the Proxy, you need to complete offline regression in advance, that is, profiling the model performance under different batch sizes and lengths, and then configure the predictor parameters. See
/instance/TTFT_predictor/README.mdfor quickly collecting model regression data. Seeproxy/metricfor Proxy predictor regression. -
Start the vLLM 0.13 + LMCache 3.11 service without PD disaggregation. The following command starts a LLaMA-70B model with TP8. Adjust it according to your needs. Also make sure that
USE_MOCK = FalseinCacheRoute/core/config.py.export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export PYTORCH_ALLOC_CONF=expandable_segments:True export MODEL_DIR=/workspace/llm-stack/models/LLM-Research/Meta-Llama-3-70B-Instruct export LMCACHE_CONFIG_FILE=/workspace/llm-stack/config/lmcache_with_redis.yaml export PYTHONHASHSEED=0 export OMP_NUM_THREADS=8 pkill -f vllm || true pkill -f api_server || true python3 -m vllm.entrypoints.openai.api_server \ --model "$MODEL_DIR" \ --served-model-name llama3-70b \ --host 0.0.0.0 --port 8000 \ --tensor-parallel-size 8 \ --gpu-memory-utilization 0.75 \ --dtype auto \ --max-model-len 4096 \ --max-num-seqs 8 \ --max-num-batched-tokens 16384 \ --kv-offloading-backend lmcache \ --kv-offloading-size 64\ --disable-hybrid-kv-cache-manager \ --kv-cache-metrics(5.1) Note that
LMCACHE_CONFIG_FILEaffects LMCache caching. CacheRoute needs to enable Redis-server-based KV caching. The currentlmcache.yamlconfiguration is:chunk_size: 256 pre_caching_hash_algorithm: "sha256_cbor" local_cpu: true max_local_cpu_size: 80.0 remote_url: "redis://127.0.0.1:6379" remote_serde: "cachegen" local_disk: null max_local_disk_size: 0 save_decode_cache: false cache_policy: "LRU" numa_mode: null -
Test whether the vLLM service starts correctly. Open a new container terminal and run the following command. Note that the URL depends on the listening port and network interface of the vLLM instance.
curl http://127.0.0.1:8000/v1/models -
Prepare the environment and warm up the Scheduler knowledge list. First, install the dependencies in
requirements.txtwithpython -m pip install -r requirements.txt. -
Enter the
testdirectory and start the CacheRoute Scheduler. See/scheduler/README.mdfor parameter options.python3 demo_scheduler.py --cacheroute --kdn-pending-overload-th 8 --kdn-active-overload-th 4 --kdn-queue-ms-overload-th 30 --cacheroute-log-decision 1 -
Warm up the KDN server. Run
demo_kdn.pyto start the KDN server throughkdn_api. Then open a new terminal and runkdn_register_cli.pyunderkdn_server. This is a packaged interactive interface. It registers text and KVCache blocks by taking knowledge block texts as input, and then builds the knowledge base. Seekdn_server/README.mdfor details. -
After KDN warm-up, start the proxy, client, and instance demos in order. For local IDE debugging, you can directly use
demo_run. Note: The startup order matters. The KDN server and Proxy register with the Scheduler after startup, and then they exchange resource information. The Instance follows the same logic with the Proxy. A wrong startup order may make the resource pool unstable. The safest startup order is[Scheduler]-[KDN_Server]-[Proxy]-[Instance]. Also, the default Proxy injection strategy istext. After enabling theiwsstrategy, Proxy takes over injection strategy selection. In this case, theInjection-typesent by the client will be overwritten and become ineffective.python3 demo_proxy.py --strategy round_robin --injection-strategy iws --ready-release-policy text_bypass python3 demo_instance.py --port <default 9001> --host <xxx> python3 demo_client.py or demo_client.py --with-ui (recommended, starts the UI version and supports automatic request validation)
Note: If an import error occurs, add the project path to the container environment:
echo 'export PYTHONPATH=/workspace/llm-stack/CacheRoute' >> ~/.bashrc
- After the Scheduler, Proxy, and Instance start, they will publish INFO logs and wait for requests. After all components are ready, enter the client. When
<client>is shown, you can input HTTP requests for a quick demo. Note that the URL should be the listening address and port of the Scheduler, so that HTTP requests can be parsed and forwarded to the Scheduler. The following gives three local test request demos.
CacheRoute exposes OpenAI-compatible API endpoints through the Scheduler.
| Endpoint | Mode |
|---|---|
/v1/chat/completions |
Chat completion |
/v1/completions |
Completion |
curl http://127.0.0.1:7001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-70b",
"messages": [{"role": "user", "content": "What is DeepSeek"}],
"max_tokens": 64,
"stream": false,
"RAG": true
}'curl http://127.0.0.1:7001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-70b",
"prompt": "What is DeepSeek",
"max_tokens": 64,
"RAG": true
}'| Option | Required | Description |
|---|---|---|
model |
Yes | Model name served by vLLM. |
messages / prompt |
Yes | Input content for chat or completion mode. |
max_tokens |
No | Maximum number of generated tokens. |
stream |
No | Whether to enable streaming responses. |
RAG |
No | Whether to enable knowledge injection. |
View runtime screenshots
The Scheduler selects KDN and Proxy according to knowledge coverage, topology, and current load.

The Proxy maintains local task queues and prepares requests for instance-level execution.

The Proxy dynamically chooses between text-based injection and KVCache-based injection.

The instance reuses injected KVCache blocks through LMCache.

The client receives OpenAI-compatible responses through the Scheduler endpoint.

CacheRoute is under active development. The current release supports:
- Scheduler-side knowledge-oriented routing.
- KDN selection based on knowledge coverage and overload filtering.
- Proxy selection based on topology, load safety window, and knowledge history.
- Proxy-side dynamic injection strategy selection.
- KDN-based text registration and KVCache registration.
- Debugging APIs such as
/debug/statusand/debug/strategy.
Suggested minimum validation commands:
cd test
python3 demo_scheduler.py --cacheroute
curl -s http://127.0.0.1:7001/debug/status
curl -s http://127.0.0.1:7001/debug/strategy- Scheduler-side knowledge-oriented routing
- Proxy-side dynamic injection strategy selection
- KDN-based text and KVCache registration
- OpenAI-compatible request forwarding
- More deployment examples
- Benchmark scripts and reproducible evaluation
- More KV cache placement policies
- Paper and citation release
| Document | Description |
|---|---|
env/README.md |
Environment setup and vLLM + LMCache installation. |
kdn_server/README.md |
KDN server, knowledge registration, and KVCache injection. |
core/README.md |
Core configuration and multi-machine setup. |
scheduler/README.md |
Scheduler parameters and routing strategies. |
doc/blog |
Development logs and update notes. |
doc/integrations/lmcache.md |
CacheRoute integration with LMCache. |

