Brad Zhang / public archive

Brad Zhang

AI product notes, agent workflow writing, and open-source dossiers for founders and early technical teams.

Open source/kv-agent-lab

Project dossier / README

kv-agent-lab

ccc7574/kv-agent-lab

A lab for key-value state and durable memory patterns in agent systems.

Durability often depends on state design more than model choice. This project keeps attention on that lower layer.

Repository reading order

Understand the workflow it replaces before deciding whether to stay in the README, open a topic path, or move toward collaboration.

01 / Workflow

Overcomplicated memory abstractions before basic state primitives are understood.

Start by naming the old workflow this repo is replacing.

02 / README

kv-agent-lab KV-cache-aware agent serving lab for shared-prompt, multi-session workloads. Chinese README: README.zh-CN.md Chinese learning guides: docs/code-walkthrough.zh-CN.md, docs/design-principles.zh-CN.md, docs/learning-labs/lab-01-compare-prefix-reus...

Then use the README and site interpretation to confirm the project boundary and delivery surface.

03 / Proof graph

Harness engineering for agentic AI products

Finally, place the repository back into a bigger market question and working conversation.

github.com/ccc7574/kv-agent-lab

kv-agent-lab

KV-cache-aware agent serving lab for shared-prompt, multi-session workloads.

Chinese README:

  • README.zh-CN.md
  • Chinese learning guides: docs/code-walkthrough.zh-CN.md, docs/design-principles.zh-CN.md, docs/learning-labs/lab-01-compare-prefix-reuse.zh-CN.md

This project is designed to study KV cache as a cross-layer systems problem:

  • model-layer KV cache growth
  • serving-layer prefix reuse and offloading
  • agent-layer planner and sub-agent prompt patterns
  • session continuity and observability

The goal is not just better benchmark numbers. The goal is to build a realistic
story about how KV cache affects production AI systems and agent platforms.

Reading Order

Order Start Here What to Focus On Related Files
1 Project focus Understand why KV cache is treated as a model, serving, and agent systems problem README.md, README.zh-CN.md
2 Reading guide Build the vocabulary before looking at reports or benchmark output docs/reading-guide.en.md, docs/glossary.en.md
3 Workload model See why shared prompts, planners, sub-agents, and sessions are the main workload shapes docs/workload-model.md, workloads/
4 Synthetic harness Run baseline and prefix-reuse reports locally without a GPU scripts/run_workload.py, scripts/compare_results.py
5 vLLM integration Understand how the same workload shape maps to a live OpenAI-compatible server docs/vllm-integration.md, scripts/run_live_workload.py
6 Remote GPU runbook Move from local workflow validation to real Linux + NVIDIA GPU runs docs/vllm-setup.md, docs/remote-linux-gpu-runbook.md
7 Policy and reporting Connect retention, eviction, offloading, and report artifacts into a platform story docs/kv-policy.md, reports/, charts/

Focus

The lab focuses on four workload shapes:

  1. Shared system prompt chat
  2. Planner-style requests
  3. Sub-agent role templates
  4. Session continuity

And three optimization modes:

  1. Baseline
  2. Prefix reuse
  3. CPU offloading

Initial Roadmap

Phase 1

  • Define realistic workloads
  • Run baseline synthetic benchmark harness
  • Capture TTFT and throughput estimates

Phase 2

  • Add prefix reuse and session continuity analysis
  • Compare workload classes instead of only aggregate metrics

Phase 3

  • Add CPU offloading scenarios
  • Add policy notes for retention and eviction

Phase 4

  • Add dashboard-ready reports
  • Turn the results into platform and interview assets

Repo Structure

kv-agent-lab/
├── docs/
├── experiments/
├── scripts/
├── workloads/
├── reports/
├── charts/
└── src/kv_agent_lab/

Reading Materials

Recommended bilingual reading support:

  • docs/reading-guide.en.md
  • docs/reading-guide.zh-CN.md
  • docs/glossary.en.md
  • docs/glossary.zh-CN.md

Quick Start

Create a baseline report:

cd /Volumes/ExtaData/newcode/kv-agent-lab
python3 scripts/run_workload.py \
  --workload workloads/planner.jsonl \
  --mode baseline \
  --report reports/planner-baseline.json

Create a prefix reuse report:

python3 scripts/run_workload.py \
  --workload workloads/planner.jsonl \
  --mode prefix_reuse \
  --report reports/planner-prefix-reuse.json

Compare two reports:

python3 scripts/compare_results.py \
  --baseline reports/planner-baseline.json \
  --candidate reports/planner-prefix-reuse.json

Run the same workload against a live vLLM server:

MODEL=Qwen/Qwen2.5-3B-Instruct ./scripts/run_server.sh
python3 scripts/run_live_workload.py \
  --workload workloads/planner.jsonl \
  --report reports/planner-live.json \
  --endpoint http://127.0.0.1:8000/v1 \
  --model Qwen/Qwen2.5-3B-Instruct

What Exists Today

The current harness is intentionally lightweight. It does not run a live model
yet. It gives a clean place to:

  • define workload classes
  • estimate KV-cache-sensitive tradeoffs
  • build reports and comparison habits

This keeps the first milestone small while preserving the long-term direction.

Next Milestone

The next implementation milestone is to plug a real serving backend into the
same workload and reporting shape, starting with vLLM.

See:

  • docs/vllm-integration.md
  • docs/vllm-setup.md
  • docs/remote-linux-gpu-runbook.md

Current Environment Reality

This repo is currently being developed on:

  • macOS
  • Apple Silicon
  • no NVIDIA GPU

So the near-term practical split is:

  • use local development for workflow and report validation
  • use remote Linux GPU for real KV-cache benchmark runs

Related clusters

A project page should also return to the bigger market question.

Commercial outcomes

Founding engineer fit

Design partner sprint

Developer-facing proof work