GitHub - BJTU-ANT/CacheRoute: CacheRoute is an innovative LLM scheduling scheme dedicated to enabling flexible KV cache reuse across LLM systems, improving task performance and system efficiency.

Flexible KV cache reuse for knowledge-intensive LLM serving

Built on vLLM and LMCache. Designed for compute-network-aware knowledge injection across LLM systems.

Why CacheRoute? • Features • Architecture • Quick Start • API • Docs

CacheRoute

CacheRoute is a lightweight LLM scheduling framework built on vLLM and LMCache to enable flexible KV cache reuse across LLM systems. It targets knowledge-intensive LLM services, such as browser AI and knowledge QA systems, where many requests repeatedly use the same external knowledge. Existing systems usually prepend long knowledge texts to the user question and send the whole prompt to the model for recomputation. Although this approach helps reduce model hallucination and improve answer quality, it introduces heavy prefill overhead and causes redundant computation when the same knowledge appears across many requests.

CacheRoute addresses this problem by using KDN servers to store KVCache blocks for popular knowledge. For each request, CacheRoute dynamically chooses between text-based injection and KVCache-based injection according to task queues, compute load, and network load. In this way, CacheRoute shifts knowledge injection cost between compute and network resources, improving task latency and system throughput.

Why CacheRoute?

🚀 Less redundant prefill computation: reuse repeated knowledge through KV cache instead of recomputing long prompts.
🔁 Cross-system KV cache reuse: share reusable knowledge across LLM systems through KDN servers.
🌐 Compute-network coordination: dynamically choose between recomputation and KV cache injection based on real-time resource load.

CacheRoute reduces average TTFT, improves system throughput, and enables more effective KVCache reuse under knowledge-intensive workloads.

Key Features

Feature	Description
⚙️ Compute-network-aware knowledge injection	CacheRoute dynamically chooses between text recomputation and KVCache reuse. It predicts task cost at the proxy and selects the injection strategy based on current task queues, compute load, and network load.
🧭 Knowledge-oriented cross-system routing	CacheRoute parses the knowledge requirement before resource-pool scheduling. The scheduler jointly considers knowledge availability, system load, and topology information, and routes requests to the LLM system that can serve the required knowledge more efficiently.
🗂️ KDN-based KV cache management	CacheRoute follows Knowledge Delivery Networks' idea, using dedicated KDN servers to register, store, query, and inject KV cache blocks for reusable knowledge. This enables external knowledge to be reused across LLM systems instead of being repeatedly recomputed.

Architecture

CacheRoute separates global routing, local injection decision, and KV cache management into Scheduler, Proxy, Instance, and KDN Server.

Scheduler: performs global resource-pool selection and knowledge-oriented task routing.
Proxy: manages local task queues and selects the knowledge injection strategy.
Instance: connects CacheRoute with vLLM + LMCache and handles execution signaling.
KDN Server: stores reusable knowledge and injects KVCache blocks when needed.

Default ports

Component	Service Plane	Control Plane
Scheduler	7001	7002
Proxy	8001	8002
Instance	9001	-
vLLM	8000	-
KDN Server	9101	-

System Workflow

The Client sends an OpenAI-compatible request to the Scheduler.
The Scheduler analyzes the knowledge requirement and selects a target resource pool.
The Proxy predicts the cost of text-based and KVCache-based injection.
The KDN Server injects reusable KVCache blocks when KVCache reuse is selected.
The Instance forwards the request to vLLM + LMCache and returns the response.

Requirements

CacheRoute has been tested with the following core environment:

Component	Version
Python	3.12.11
vLLM	0.13.x
LMCache	0.3.x
PyTorch	2.9.x
Redis	7
CUDA GPUs	Required for full LLM serving

Install Python dependencies with:

pip install -r requirements.txt

Quick Start

CacheRoute provides two ways to get started.

Option 1: Lightweight Demo

Use the demo scripts to understand the CacheRoute scheduling workflow.

cd test

python3 demo_scheduler.py --cacheroute
python3 demo_kdn.py
python3 demo_proxy.py --strategy round_robin --injection-strategy iws --ready-release-policy text_bypass
python3 demo_instance.py --port 9001 --host 127.0.0.1
python3 demo_client.py --with-ui

Option 2: Full CacheRoute Deployment

For full deployment with vLLM, LMCache, Redis, KDN warm-up, and KVCache injection, see:

env/README.md for environment setup.
kdn_server/README.md for KDN registration and KVCache injection.
core/README.md for multi-machine configuration.

Full single-machine deployment guide

Place the whole CacheRoute project under /workspace/.

Create a new container that supports vLLM. The required image is cacheroute:vllm0.13-lmcache3.11-pytorch2.9.1 built from source. If you do not know how to quickly deploy the CacheRoute environment or download models, see /env/README.md.

sudo docker run --gpus all -it --name CacheRoute --network host --ipc=host --shm-size=64g --ulimit memlock=-1 --ulimit stack=67108864 --memory=0 --memory-swap=0 -p 8000:8000 -v /llm-stack:/workspace/llm-stack cacheroute:vllm0.13-lmcache3.11-pytorch2.9.1 bash

Start and enter the container. This is useful when you need to open multiple container terminals.

sudo docker start CacheRoute 
sudo docker exec -it CacheRoute bash

First, start a Redis container as the later KVCache store for LMcache_connector.

sudo docker run -d --name lmcache-redis --network host redis:7 redis-server --bind 0.0.0.0 --protected-mode no --save "" --appendonly no --maxmemory 200gb --maxmemory-policy allkeys-lru

Configure the required parameters in core/config.py according to the actual model download paths. The Scheduler strongly depends on the embedding model, tokenizer, and LLM model.
```
DEFAULT_MODEL:                               Path of the LLM to run
DEFAULT_MODEL_SHORTNAME:                     Short name of the LLM, used by later vLLM startup commands
SCHEDULER/PROXY/INSTANCE/KDN_LOG_FILE:       Log output paths of Scheduler/proxy/instance/kdn, <path-to-Cacheroute/log/**>
EMBEDDING_MODEL:                             Actual path of the locally downloaded embedding model, <path-to-Cacheroute/model/embedder/**>
DEFAULT_EMBED_MODEL:                         Embedding model name, used to download from Hugging Face when EMBEDDING_MODEL is not configured
...
```
There are also many other parameters. See core/config.py for detailed descriptions, and see test/demo_*** for usage examples.
4.2 To enable KVCache reuse across containers, CacheRoute replaces the unstable builtin+SEED key generation method with sha256_cbor. However, because of output format mismatch, CacheRoute patches token_database.py. Therefore, you need to replace lmcache/v1/token_database.py and lmcache/v1/memory_management.py in the LMCache source code with CacheRoute/env/token_database.py and CacheRoute/env/memory_management.py.
4.3 CacheRoute supports interconnection and scheduling across multi-level inference resource pools. For a quick demo on a single device, this tutorial uses a single-machine setup. It connects scheduler, proxy, instance, and kdn_server through loopback addresses and separates modules by ports. For multi-machine experiments, you need to modify the related configurations in config.py and demo. See core/README.md for details.
To enable the TTFT predictor in the Proxy, you need to complete offline regression in advance, that is, profiling the model performance under different batch sizes and lengths, and then configure the predictor parameters. See /instance/TTFT_predictor/README.md for quickly collecting model regression data. See proxy/metric for Proxy predictor regression.

Start the vLLM 0.13 + LMCache 3.11 service without PD disaggregation. The following command starts a LLaMA-70B model with TP8. Adjust it according to your needs. Also make sure that USE_MOCK = False in CacheRoute/core/config.py.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTORCH_ALLOC_CONF=expandable_segments:True
export MODEL_DIR=/workspace/llm-stack/models/LLM-Research/Meta-Llama-3-70B-Instruct
export LMCACHE_CONFIG_FILE=/workspace/llm-stack/config/lmcache_with_redis.yaml
export PYTHONHASHSEED=0
export OMP_NUM_THREADS=8

pkill -f vllm || true
pkill -f api_server || true

python3 -m vllm.entrypoints.openai.api_server \
 --model "$MODEL_DIR" \
 --served-model-name llama3-70b \
 --host 0.0.0.0 --port 8000 \
 --tensor-parallel-size 8 \
 --gpu-memory-utilization 0.75 \
 --dtype auto \
 --max-model-len 4096 \
 --max-num-seqs 8 \
 --max-num-batched-tokens 16384 \
 --kv-offloading-backend lmcache \
 --kv-offloading-size 64\
 --disable-hybrid-kv-cache-manager \
 --kv-cache-metrics

(5.1) Note that LMCACHE_CONFIG_FILE affects LMCache caching. CacheRoute needs to enable Redis-server-based KV caching. The current lmcache.yaml configuration is:

chunk_size: 256
pre_caching_hash_algorithm: "sha256_cbor"

local_cpu: true
max_local_cpu_size: 80.0

remote_url: "redis://127.0.0.1:6379"
remote_serde: "cachegen"

local_disk: null
max_local_disk_size: 0

save_decode_cache: false
cache_policy: "LRU"
numa_mode: null

Test whether the vLLM service starts correctly. Open a new container terminal and run the following command. Note that the URL depends on the listening port and network interface of the vLLM instance.
```
curl http://127.0.0.1:8000/v1/models
```
Prepare the environment and warm up the Scheduler knowledge list. First, install the dependencies in requirements.txt with python -m pip install -r requirements.txt.

Enter the test directory and start the CacheRoute Scheduler. See /scheduler/README.md for parameter options.

python3 demo_scheduler.py --cacheroute --kdn-pending-overload-th 8 --kdn-active-overload-th 4 --kdn-queue-ms-overload-th 30 --cacheroute-log-decision 1

Warm up the KDN server. Run demo_kdn.py to start the KDN server through kdn_api. Then open a new terminal and run kdn_register_cli.py under kdn_server. This is a packaged interactive interface. It registers text and KVCache blocks by taking knowledge block texts as input, and then builds the knowledge base. See kdn_server/README.md for details.
After KDN warm-up, start the proxy, client, and instance demos in order. For local IDE debugging, you can directly use demo_run. Note: The startup order matters. The KDN server and Proxy register with the Scheduler after startup, and then they exchange resource information. The Instance follows the same logic with the Proxy. A wrong startup order may make the resource pool unstable. The safest startup order is [Scheduler]-[KDN_Server]-[Proxy]-[Instance]. Also, the default Proxy injection strategy is text. After enabling the iws strategy, Proxy takes over injection strategy selection. In this case, the Injection-type sent by the client will be overwritten and become ineffective.
```
python3 demo_proxy.py --strategy round_robin --injection-strategy iws --ready-release-policy text_bypass
python3 demo_instance.py --port <default 9001> --host <xxx>
python3 demo_client.py or demo_client.py --with-ui (recommended, starts the UI version and supports automatic request validation)
```

Note: If an import error occurs, add the project path to the container environment: echo 'export PYTHONPATH=/workspace/llm-stack/CacheRoute' >> ~/.bashrc

After the Scheduler, Proxy, and Instance start, they will publish INFO logs and wait for requests. After all components are ready, enter the client. When <client> is shown, you can input HTTP requests for a quick demo. Note that the URL should be the listening address and port of the Scheduler, so that HTTP requests can be parsed and forwarded to the Scheduler. The following gives three local test request demos.

API Usage

CacheRoute exposes OpenAI-compatible API endpoints through the Scheduler.

Endpoint	Mode
`/v1/chat/completions`	Chat completion
`/v1/completions`	Completion

Chat Completion

curl http://127.0.0.1:7001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-70b",
    "messages": [{"role": "user", "content": "What is DeepSeek"}],
    "max_tokens": 64,
    "stream": false,
    "RAG": true
  }'

Completion

curl http://127.0.0.1:7001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-70b",
    "prompt": "What is DeepSeek",
    "max_tokens": 64,
    "RAG": true
  }'

Request Options

Option	Required	Description
`model`	Yes	Model name served by vLLM.
`messages` / `prompt`	Yes	Input content for chat or completion mode.
`max_tokens`	No	Maximum number of generated tokens.
`stream`	No	Whether to enable streaming responses.
`RAG`	No	Whether to enable knowledge injection.

Demo Screenshots

View runtime screenshots

Scheduler task scheduling

The Scheduler selects KDN and Proxy according to knowledge coverage, topology, and current load.

Proxy task scheduling

The Proxy maintains local task queues and prepares requests for instance-level execution.

Injection strategy selection

The Proxy dynamically chooses between text-based injection and KVCache-based injection.

vLLM + LMCache reuse

The instance reuses injected KVCache blocks through LMCache.

Client response

The client receives OpenAI-compatible responses through the Scheduler endpoint.

Current Status

CacheRoute is under active development. The current release supports:

Scheduler-side knowledge-oriented routing.
KDN selection based on knowledge coverage and overload filtering.
Proxy selection based on topology, load safety window, and knowledge history.
Proxy-side dynamic injection strategy selection.
KDN-based text registration and KVCache registration.
Debugging APIs such as /debug/status and /debug/strategy.

Suggested minimum validation commands:

cd test
python3 demo_scheduler.py --cacheroute
curl -s http://127.0.0.1:7001/debug/status
curl -s http://127.0.0.1:7001/debug/strategy

Roadmap

Scheduler-side knowledge-oriented routing
Proxy-side dynamic injection strategy selection
KDN-based text and KVCache registration
OpenAI-compatible request forwarding
More deployment examples
Benchmark scripts and reproducible evaluation
More KV cache placement policies
Paper and citation release

Documentation

Document	Description
`env/README.md`	Environment setup and vLLM + LMCache installation.
`kdn_server/README.md`	KDN server, knowledge registration, and KVCache injection.
`core/README.md`	Core configuration and multi-machine setup.
`scheduler/README.md`	Scheduler parameters and routing strategies.
`doc/blog`	Development logs and update notes.
`doc/integrations/lmcache.md`	CacheRoute integration with LMCache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CacheRoute

Why CacheRoute?

Key Features

Architecture

Default ports

System Workflow

Requirements

Quick Start

Option 1: Lightweight Demo

Option 2: Full CacheRoute Deployment

API Usage

Chat Completion

Completion

Request Options

Demo Screenshots

Scheduler task scheduling

Proxy task scheduling

Injection strategy selection

vLLM + LMCache reuse

Client response

Current Status

Roadmap

Documentation

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 648 Commits
.assets		.assets
UI		UI
__pycache__		__pycache__
client		client
core		core
data		data
doc		doc
env		env
instance		instance
kdn_server		kdn_server
log		log
model		model
proxy		proxy
scheduler		scheduler
store		store
test		test
util		util
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CacheRoute

Why CacheRoute?

Key Features

Architecture

Default ports

System Workflow

Requirements

Quick Start

Option 1: Lightweight Demo

Option 2: Full CacheRoute Deployment

API Usage

Chat Completion

Completion

Request Options

Demo Screenshots

Scheduler task scheduling

Proxy task scheduling

Injection strategy selection

vLLM + LMCache reuse

Client response

Current Status

Roadmap

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages