kv-agent-lab
KV-cache-aware agent serving lab for shared-prompt, multi-session workloads.
Chinese README:
README.zh-CN.md- Chinese learning guides:
docs/code-walkthrough.zh-CN.md,docs/design-principles.zh-CN.md,docs/learning-labs/lab-01-compare-prefix-reuse.zh-CN.md
This project is designed to study KV cache as a cross-layer systems problem:
- model-layer KV cache growth
- serving-layer prefix reuse and offloading
- agent-layer planner and sub-agent prompt patterns
- session continuity and observability
The goal is not just better benchmark numbers. The goal is to build a realistic
story about how KV cache affects production AI systems and agent platforms.
Reading Order
| Order | Start Here | What to Focus On | Related Files |
|---|---|---|---|
| 1 | Project focus | Understand why KV cache is treated as a model, serving, and agent systems problem | README.md, README.zh-CN.md |
| 2 | Reading guide | Build the vocabulary before looking at reports or benchmark output | docs/reading-guide.en.md, docs/glossary.en.md |
| 3 | Workload model | See why shared prompts, planners, sub-agents, and sessions are the main workload shapes | docs/workload-model.md, workloads/ |
| 4 | Synthetic harness | Run baseline and prefix-reuse reports locally without a GPU | scripts/run_workload.py, scripts/compare_results.py |
| 5 | vLLM integration | Understand how the same workload shape maps to a live OpenAI-compatible server | docs/vllm-integration.md, scripts/run_live_workload.py |
| 6 | Remote GPU runbook | Move from local workflow validation to real Linux + NVIDIA GPU runs | docs/vllm-setup.md, docs/remote-linux-gpu-runbook.md |
| 7 | Policy and reporting | Connect retention, eviction, offloading, and report artifacts into a platform story | docs/kv-policy.md, reports/, charts/ |
Focus
The lab focuses on four workload shapes:
- Shared system prompt chat
- Planner-style requests
- Sub-agent role templates
- Session continuity
And three optimization modes:
- Baseline
- Prefix reuse
- CPU offloading
Initial Roadmap
Phase 1
- Define realistic workloads
- Run baseline synthetic benchmark harness
- Capture TTFT and throughput estimates
Phase 2
- Add prefix reuse and session continuity analysis
- Compare workload classes instead of only aggregate metrics
Phase 3
- Add CPU offloading scenarios
- Add policy notes for retention and eviction
Phase 4
- Add dashboard-ready reports
- Turn the results into platform and interview assets
Repo Structure
kv-agent-lab/
├── docs/
├── experiments/
├── scripts/
├── workloads/
├── reports/
├── charts/
└── src/kv_agent_lab/
Reading Materials
Recommended bilingual reading support:
docs/reading-guide.en.mddocs/reading-guide.zh-CN.mddocs/glossary.en.mddocs/glossary.zh-CN.md
Quick Start
Create a baseline report:
cd /Volumes/ExtaData/newcode/kv-agent-lab
python3 scripts/run_workload.py \
--workload workloads/planner.jsonl \
--mode baseline \
--report reports/planner-baseline.jsonCreate a prefix reuse report:
python3 scripts/run_workload.py \
--workload workloads/planner.jsonl \
--mode prefix_reuse \
--report reports/planner-prefix-reuse.jsonCompare two reports:
python3 scripts/compare_results.py \
--baseline reports/planner-baseline.json \
--candidate reports/planner-prefix-reuse.jsonRun the same workload against a live vLLM server:
MODEL=Qwen/Qwen2.5-3B-Instruct ./scripts/run_server.sh
python3 scripts/run_live_workload.py \
--workload workloads/planner.jsonl \
--report reports/planner-live.json \
--endpoint http://127.0.0.1:8000/v1 \
--model Qwen/Qwen2.5-3B-InstructWhat Exists Today
The current harness is intentionally lightweight. It does not run a live model
yet. It gives a clean place to:
- define workload classes
- estimate KV-cache-sensitive tradeoffs
- build reports and comparison habits
This keeps the first milestone small while preserving the long-term direction.
Next Milestone
The next implementation milestone is to plug a real serving backend into the
same workload and reporting shape, starting with vLLM.
See:
docs/vllm-integration.mddocs/vllm-setup.mddocs/remote-linux-gpu-runbook.md
Current Environment Reality
This repo is currently being developed on:
- macOS
- Apple Silicon
- no NVIDIA GPU
So the near-term practical split is:
- use local development for workflow and report validation
- use remote Linux GPU for real KV-cache benchmark runs