Project dossier

kv-agent-lab

A lab for key-value state and durable memory patterns in agent systems.

Python · 0 stars / 0 forks · Active Apr 28, 2026 · GitHub

What this repository answers

kv-agent-lab KV-cache-aware agent serving lab for shared-prompt, multi-session workloads. Chinese README: README.zh-CN.md Chinese learning guides: docs/code-walkthrough.zh-CN.md, docs/design-principles.zh-CN.md, docs/learning-labs/lab-01-compare-prefix-reus...

Workflow replaced: Overcomplicated memory abstractions before basic state primitives are understood.

Why it matters: Durability often depends on state design more than model choice. This project keeps attention on that lower layer.

How it gets used

Explore durable state patterns for agents before betting on a heavier architecture.
Clarify where persistence should live in an AI workflow.

README

kv-agent-lab

KV-cache-aware agent serving lab for shared-prompt, multi-session workloads.

Chinese README:

README.zh-CN.md
Chinese learning guides: docs/code-walkthrough.zh-CN.md, docs/design-principles.zh-CN.md, docs/learning-labs/lab-01-compare-prefix-reuse.zh-CN.md

This project is designed to study KV cache as a cross-layer systems problem:

model-layer KV cache growth
serving-layer prefix reuse and offloading
agent-layer planner and sub-agent prompt patterns
session continuity and observability

The goal is not just better benchmark numbers. The goal is to build a realistic
story about how KV cache affects production AI systems and agent platforms.

Reading Order

Order	Start Here	What to Focus On	Related Files
1	Project focus	Understand why KV cache is treated as a model, serving, and agent systems problem	`README.md`, `README.zh-CN.md`
2	Reading guide	Build the vocabulary before looking at reports or benchmark output	`docs/reading-guide.en.md`, `docs/glossary.en.md`
3	Workload model	See why shared prompts, planners, sub-agents, and sessions are the main workload shapes	`docs/workload-model.md`, `workloads/`
4	Synthetic harness	Run baseline and prefix-reuse reports locally without a GPU	`scripts/run_workload.py`, `scripts/compare_results.py`
5	vLLM integration	Understand how the same workload shape maps to a live OpenAI-compatible server	`docs/vllm-integration.md`, `scripts/run_live_workload.py`
6	Remote GPU runbook	Move from local workflow validation to real Linux + NVIDIA GPU runs	`docs/vllm-setup.md`, `docs/remote-linux-gpu-runbook.md`
7	Policy and reporting	Connect retention, eviction, offloading, and report artifacts into a platform story	`docs/kv-policy.md`, `reports/`, `charts/`

Focus

The lab focuses on four workload shapes:

Shared system prompt chat
Planner-style requests
Sub-agent role templates
Session continuity

And three optimization modes:

Baseline
Prefix reuse
CPU offloading

Initial Roadmap

Phase 1

Define realistic workloads
Run baseline synthetic benchmark harness
Capture TTFT and throughput estimates

Phase 2

Add prefix reuse and session continuity analysis
Compare workload classes instead of only aggregate metrics

Phase 3

Add CPU offloading scenarios
Add policy notes for retention and eviction

Phase 4

Add dashboard-ready reports
Turn the results into platform and interview assets

Repo Structure

kv-agent-lab/
├── docs/
├── experiments/
├── scripts/
├── workloads/
├── reports/
├── charts/
└── src/kv_agent_lab/

Reading Materials

Quick Start

Create a baseline report:

cd /Volumes/ExtaData/newcode/kv-agent-lab
python3 scripts/run_workload.py \
  --workload workloads/planner.jsonl \
  --mode baseline \
  --report reports/planner-baseline.json

Create a prefix reuse report:

python3 scripts/run_workload.py \
  --workload workloads/planner.jsonl \
  --mode prefix_reuse \
  --report reports/planner-prefix-reuse.json

Compare two reports:

python3 scripts/compare_results.py \
  --baseline reports/planner-baseline.json \
  --candidate reports/planner-prefix-reuse.json

Run the same workload against a live vLLM server:

MODEL=Qwen/Qwen2.5-3B-Instruct ./scripts/run_server.sh
python3 scripts/run_live_workload.py \
  --workload workloads/planner.jsonl \
  --report reports/planner-live.json \
  --endpoint http://127.0.0.1:8000/v1 \
  --model Qwen/Qwen2.5-3B-Instruct

What Exists Today

The current harness is intentionally lightweight. It does not run a live model
yet. It gives a clean place to:

define workload classes
estimate KV-cache-sensitive tradeoffs
build reports and comparison habits

This keeps the first milestone small while preserving the long-term direction.

Next Milestone

The next implementation milestone is to plug a real serving backend into the
same workload and reporting shape, starting with vLLM.

See:

docs/vllm-integration.md
docs/vllm-setup.md
docs/remote-linux-gpu-runbook.md

Current Environment Reality

This repo is currently being developed on:

macOS
Apple Silicon
no NVIDIA GPU

So the near-term practical split is:

use local development for workflow and report validation
use remote Linux GPU for real KV-cache benchmark runs