Prefill As soon as, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines

A humorous-but-real tour of SwarmKV — KV-snapshot fan-out, copy-on-fork host buffers, and tips on how to make a two-agent analytical pipeline ~1.95× quicker (and the second department’s activation latency 52× quicker) by being mildly imply to llama.cpp.

of the “Manufacturing-Grade Agentic Inference” collection. Every half removes one form of redundant work from an agentic LLM pipeline. Half 1 (this put up) kills redundant prefill. Half 2 tackles redundant ready — how 50 micro-agents share one GPU by way of time-slicing. Half 3 retains RAG retrieval on the GPU with a customized CUDA Prime-Ok kernel. Half 4 persists agent state throughout hand-offs so the subsequent agent by no means has the cold-start downside.

Key takeaways

The issue: when a number of brokers learn the identical doc, a default serving stack makes each rerun the very same prefill. That redundant dense consideration move is pure waste.

The repair: run prefill as soon as, serialize the KV cache to a number buffer, memcpy it per department, and restore it earlier than decoding. “Compute as soon as, fan out.”

The receipts: on a seven-year-old GTX 1080, a two-agent pipeline obtained 48.69% quicker finish to finish (~1.95×) and the second agent’s activation latency dropped 98.09% (~52×), eliminating 8,685 ms of redundant compute.

The kicker: this isn’t a brand new algorithm. It’s methods engineering — and it’s the identical “broadcast shared state as soon as” choice a 5G cell tower has made each 80 ms since LTE.

TL;DR: Normal LLM serving makes each analytical agent re-prefill the identical shared doc. Your GPU dutifully re-executes billions of redundant prefix-prefill multiplications. The identical bytes. The identical weights. The identical quantization. All to recalculate a state it already completed calculating 4 seconds in the past. SwarmKV runs the prefill as soon as, serializes the ensuing KV state to a number buffer by way of llama_state_get_data, memcpys that buffer right into a per-branch allocation, and lets every department restore the snapshot with llama_state_seq_set_data earlier than decoding from the place the doc left off. Sure, it’s a actual round-trip—serialize, copy, restore, however as a result of redundant prefill compute scales quadratically whereas the KV state switch scales linearly, transferring the info throughout a constrained Pascal reminiscence bus remains to be vastly cheaper than recalculating the eye matrices from scratch. That’s mirrored on the consequence on a seven-year-old GTX 1080: 48.69% end-to-end speedup on a two-agent pipeline, 98.09% discount in branch-2 activation latency (~52×), 8,685 ms of redundant dense compute eradicated, zero new transformer tips. Simply the systems-engineering place that “compute as soon as, fan out” beats “compute N occasions, hope no one notices.”

Github Repo: https://github.com/AnubhabBanerjee/swarmkv

(Fast confession earlier than we begin: I got here at this from a 5G/6G RAN engineering background. Because it seems, fanning out a shared computation to many downstream customers is shockingly near what a cell tower has been doing each 80 ms since LTE when it broadcasts SIB1. There’s a complete part on that under — part 8 — nevertheless it’s additionally why I’m penning this within the first place.)

Structure psychological mannequin — hold this open when you learn.

Doc → PrefillNode → llama_state_get_data → host KV buffer → memcpy per department → llama_state_seq_set_data → AnalyticalNode decode (RoPE continues at prefix_seq_len)

Every thing under is simply commentary on one a part of that line.

1. A confession: most of your second agent’s “work” is a rerun

If in case you have ever pointed two analytical brokers on the identical doc by way of vanilla llama.cpp, here’s what actually occurs (with a bit of little bit of intentional dramatization):

You: “Please present an outline of this 3,500-token spec, and individually, checklist its license obligations.”

llama.cpp (Agent 1): “Certain. Loading mannequin. Prefilling doc. Decoding reply.”

GPU spends 4,346 ms on dense consideration

llama.cpp (Agent 1): “Executed. Right here’s a 6-token reply.”

You: “Nice. Now Agent 2.”

llama.cpp (Agent 2): “Certain. Loading mannequin. Prefilling doc — “

You: “Wait, you actually simply did that.”

llama.cpp (Agent 2): “I’m an impartial llama_context. I’ve no reminiscence of Agent 1. I’ve no reminiscence of something. I’m a lovely, stateless new child.” 🫡

GPU spends one other 4,339 ms on bit-for-bit an identical consideration math.

Your GPU thermal sensor: will get a great exercise;
Your AWS invoice: develops a humorousness;
Your second agent’s TTFT: 4.3 seconds earlier than it may well reply a 4-token query.

That’s the joke. That’s the soiled secret of each “agentic” pipeline that spreads out from a shared doc. Every department begins from a clean slate and rebuilds the identical KV cache the earlier department simply completed constructing. The deeper the doc, the more severe the tax. At 3,500 tokens on a Pascal GPU, the huge majority of the second agent’s perceived latency isn’t the reply — it’s studying the doc once more.

SwarmKV is what occurs if you resolve the second studying is optionally available and you’d moderately write 1,500 traces of C++ than let every agent construct the identical KV cache time and again.

Now think about, the toy demo on this repo is about two brokers over a abstract/license-check. The actual form of the workload it’s constructed for is N specialised evaluators over one dense technical doc. Image an AI patent and prior-art pipeline: one 50,000-token technical specification on the root, and fifty concurrent branches evaluating novelty, mapping claims, retrieving prior artwork, checking freedom-to-operate, assessing moral compliance, and translating into jurisdiction-specific language. The baseline value of that pipeline on a default serving stack is fifty full prefills of the identical spec. The SwarmKV value is one prefill plus fifty memcpys. That asymmetry is deliberately designed, and your complete purpose the repo exists. I’ve written individually about detecting AI in invention reporting — that is the infrastructure half of that half. Inventor’s-notebook issues are precisely the rationale SwarmKV is constructed round.

2. Why does prefill exist in any respect? (a one-minute crash course)

Skip this for those who already know. For everybody else, right here is the quick model.

An autoregressive LLM serves a request in two phases. Prefill is the dense move that pushes each immediate token by way of each transformer layer as soon as and populates the per-layer key/worth (KV) cache. Decode then runs token by token, attending to the prefilled KV cache and rising it incrementally.

Prefill value grows roughly linearly with immediate size. Decode, by comparability, is reasonable per token. On a Pascal-class GTX 1080 operating Qwen2.5-7B Q4_K_M, prefilling a ~3,500-token doc takes about 4.3 seconds; decoding a brief department immediate takes a whole lot of milliseconds, since it’s dominated by setup, not arithmetic. That point distinction between the prefill and decode is strictly the leverage SwarmKV makes use of.

Mainstream serving stacks (vLLM, TGI, SGLang, llama.cpp’s personal server) deal with each request as an impartial context. A few of them have prefix caching, however it’s normally request-scoped or session-scoped — not graph-scoped. They’re constructed to maximise throughput throughout many impartial person prompts, to not share state inside one analytical pipeline that followers out from a single shared doc. For that DAG-shaped workload — one root, many leaves, identical information — each public stack I attempted made me pay for the foundation as soon as per leaf.

SwarmKV serves as the express orchestration layer, leveraged in C++ to bypass runtime abstractions, assure deterministic pointer life-cycles, and drive hardware-level memcpy effectivity.

3. The “simply snapshot the KV” lightbulb (and why it’s more durable than it sounds)

The pitch is easy:

Run prefill as soon as on the shared doc below sequence id kSwarmkvPrefixSeqId.
Serialize the ensuing KV state into a number buffer by way of llama_state_get_data.
For every downstream department, memcpy that buffer right into a per-branch allocation.
Spin up a recent llama_context, name llama_state_seq_set_data to put in the snapshot, then decode the department immediate with RoPE positions persevering with from prefix_seq_len.

That is the ‘compute as soon as, fan out’ paradigm. The one purpose it takes greater than a 30-line llama.cpp patch to realize is that three tedious edge instances instantly break the naive method. The idea is superbly easy and needs to be a simple weekend mission, however low-level {hardware} and methods realities make it a large engineering problem to really implement.

Drawback A: How large is the KV?

A simple reply: n_layers × n_head_kv × n_ctx × head_dim × dtype × 2. Properly, that hand-derived quantity drifts each time the quantization format adjustments, each time the GQA ratio adjustments, or each time the engine provides a brand new state subject. The one sincere quantity is the one the engine tells you below the present construct.

So MemoryPool spins up a disposable llama_context solely to ask:

size_t MemoryPool::get_required_kv_size(uint32_t n_ctx) {
    // Begin from library defaults so fields we don't care about stay sane.
    llama_context_params params = llama_context_default_params();
    params.n_ctx = n_ctx;

    // Assemble a disposable context solely to question serialized state footprint.
    llama_context * ctx = llama_init_from_model(model_ref, params);
    if (!ctx) {
        throw std::runtime_error("MemoryPool::get_required_kv_size: llama_init_from_model failed.");
    }

    // Ask llama.cpp what number of bytes a full state blob would occupy for this ctx.
    const size_t sz = llama_state_get_size(ctx);
    llama_free(ctx);

    // If the engine studies zero, fall again to a small non-zero allocation so checks
    // nonetheless train the registry with out pretending we all know actual tensor layouts.
    if (sz == 0) {
        return size_t{1} << 20;
    }
    return sz;
}

The logic right here is easy: ask the engine politely as an alternative of trusting a pdf or a mathematical system. Simply spin up a context, ask the dimensions and allocate precisely that a lot – a easy but very profitable recipe.

Drawback B: llama.cpp is a choosy eater about concurrent decode

Beneath the pinned upstream llama.cpp revision and GPU configuration used on this mission, concurrent decode from a number of threads on a single GPU was not reliably secure. The precise behaviour relies on the backend, the model, and the graph scheduler — in newer revisions or with remoted streams it could behave higher — however in our setup the failure modes have been one among: (a) crash, (b) corrupted KV, or (c) a ten-minute hold when you Google whether or not ggml has a thread-local area but. Spoiler: within the pinned upstream, not likely.

The sturdy reply is to serialize the llama API floor on the boundary:

namespace swarmkv {

// llama.cpp CUDA paths usually are not secure for concurrent decode from a number of threads
// on one GPU with out exterior serialization. All node execute() our bodies should
// maintain this mutex round llama_init / llama_decode / llama_free / state I/O.
inline std::mutex & llama_api_mutex() {
    static std::mutex m;
    return m;
}

} // namespace swarmkv

struct LlamaGuard {
    std::lock_guard<:mutex> lock;
    LlamaGuard() : lock(swarmkv::llama_api_mutex()) {}
};

A easy 20-line header defines your complete concurrency coverage. Each node’s execute() physique holds this round llama_init_from_model / llama_decode / llama_state_seq_set_data / llama_free. The DAG-level concurrency is actual (futures, dependencies, fanout); the GPU compute interleaves below a world lock. Pedants will accurately word this leaves perf on the ground in comparison with a hypothetical concurrent-decode upstream. Maintain that thought — it’s the actual bottleneck Half 2 of this collection goes after.

Drawback C: There is no such thing as a steady exterior KV bind API

The aesthetically excellent implementation could be to allocate one contiguous KV buffer, connect it to the brand new context immediately after which skip the memcpy solely. Upstream llama.cpp exposes llama_memory_t and graph decode paths, however the public header pinned on this repo doesn’t ship a steady, exported llama_kv_cache_bind-style image.

So SwarmKV does the next-best factor: it retains the decision web site, names it actually, and writes this path on high of llama_state_set_data as an alternative.

void KVHandoff::bind_contiguous_cache(llama_context * ctx, ggml_backend_buffer_t cache) {
    // Validate arguments so misuse fails quick throughout bring-up and CI smoke runs.
    // A null context can't decode; a null cache deal with is a configuration bug.
    if (!ctx || !cache) {
        throw std::invalid_argument("KVHandoff::bind_contiguous_cache: null context or buffer.");
    }

    // Explicitly mark each parameters as deliberately unused on this revision.
    // This prevents -Wunused-parameter warnings below strict warning flags.
    (void) ctx;
    (void) cache;

    // No steady bind name is issued right here; see file-level remark above.
    // When upstream provides a supported attachment API, implement it solely on this perform.
}

I do know, I do know. It’s a perform that does nothing. It has full argument validation, a docstring twice the size of the physique, and a steady place within the name graph. It’s patiently ready for the day upstream lets it really do its job. I’ve written extra sincere code in my life, I simply can’t bear in mind when!

That is additionally the half the place cautious readers go “wait, if bind_contiguous_cache is a no-op, what’s the MemoryPool buffer even for?” Wonderful query. It’s the staging space — the canonical buffer the place PrefillNode writes its llama_state_get_data blob, and the supply that every department memcpys from. Decode itself makes use of the context’s internally-managed KV. Pool buffer = host-side fan-out scratch; context KV = the engine’s personal factor. Two reminiscence areas, one snapshot, zero magic.

4. The five-step pipeline (the actually-cool half)

Step 0:  Validate doc + max_branch + 128 ≤ n_ctx        (context_budget.h, fail-fast)
Step 1:  Construct the DAG; DFS-check for cycles            (Orchestrator)
Step 2:  Spawn std::async staff; gate on futures      (Orchestrator)
Step 3:  Prefill as soon as, serialize KV to host buffer      (PrefillNode + MemoryPool)
Step 4:  memcpy snapshot → department buffer → decode       (AnalyticalNode + KVHandoff)

Let’s stroll by way of each with the actual code. The snippets have been saved quick intentionally, whereas the complete information are tiny and price studying.

Step 0 — Fail-fast context price range

Three traces that prevent from a 3 AM Slack message out of your previous self:

const int32_t required = prefix_tokens + max_branch + generation_headroom;
    if (required > restrict) {
        throw std::runtime_error(
            "Context price range exceeded: prefix_tokens=" + std::to_string(prefix_tokens) +
            " max_branch_tokens=" + std::to_string(max_branch) +
            " headroom=" + std::to_string(generation_headroom) +
            " required=" + std::to_string(required) + " n_ctx=" + std::to_string(restrict));
    }

This runs earlier than any context is constructed, any pool buffer is allotted or any GPU reminiscence is touched. If you happen to ask SwarmKV to prefill 4,000 tokens into an n_ctx=4096 context with two branches and 128 tokens of decode headroom, it tells you the maths doesn’t work and goes to sleep. The kindest factor you are able to do in your future self is to reject inconceivable configurations earlier than even allocating the primary byte.

Step 1 — DAG cycle detection

The orchestrator does a normal 3-color DFS on the dependency adjacency checklist:

// dfs lambda walks adjacency lists and throws when a back-edge signifies a cycle.
    auto dfs = [&](auto self, const std::string & u) -> void {
        // Mark node u as at the moment on the recursion stack (visiting).
        state[u] = 1;
        // Discover all outgoing dependency edges from u to downstream nodes v.
        for (const auto & v : adj[u]) {
            // If v is visiting, we discovered a cycle u -> v and should abort pipeline setup.
            if (state[v] == 1) {
                // Throw with edge names so graph misconfiguration is straightforward to diagnose.
                throw std::runtime_error("Dependency cycle detected: " + u + " -> " + v);
            }
            // Recurse solely when v has not been totally processed but.
            if (state[v] == 0) {
                // Proceed DFS from baby node v.
                self(self, v);
            }
        }

I do know it’s boring, however belief me, it’s needed. It’s the algorithmic equal of checking your shoelaces earlier than operating. Skip it as soon as and your pipeline will spend the remainder of its quick life ready on itself. The error message consists of the offending edge, so you’ll find the typo with out grepping.

Step 2 — One `std::async` per node, gated on shared futures

worker_tasks.push_back(std::async(
            std::launch::async,
            [this, name, state, dependencies, &completion_promises, &completion_futures]() {
                // Learn this node's watermark requirement as soon as for dependency gating selections.
                const int32_t req = nodes.at(identify)->required_prefix_tokens();
                // Look ahead to every upstream dependency based on V2 watermark guidelines.
                for (const auto & dep_name : dependencies) {
                    // Resolve upstream node pointer for prefill supplier detection.
                    ExecutionNode * dep = nodes.at(dep_name).get();
                    // If upstream is prefiller and this department makes use of watermark gating, wait on watermark.
                    if (dep->is_prefill_provider() && req >= 0) {
                        // Block till PipelineState watermark >= required_prefix_tokens (speculative begin).
                        state->wait_for_watermark(req);
                    } else {
                        // In any other case protect V1 conduct: wait till upstream node thread completes.
                        completion_futures.at(dep_name).wait();
                    }
                }
                // Construct llama_context_params with orchestrator default n_ctx price range.
                llama_context_params params = llama_context_default_params();
                // Carry n_ctx to SwarmKV default pipeline context for multi-k token paperwork.
                params.n_ctx = kSwarmkvDefaultPipelineCtx;
                // Bundle mannequin/pool/identify into OrchestratorContext for node execute().
                OrchestratorContext ctx = {
                    this->memory_pool->get_model(),
                    params,
                    this->memory_pool,
                    identify.c_str(),
                };
                // Run node logic and fulfill promise so dependents can proceed.
                strive {
                    // Dispatch to PrefillNode or AnalyticalNode implementation.
                    nodes.at(identify)->execute(state, &ctx);
                    // Sign profitable completion to shared_future waiters.
                    completion_promises.at(identify).set_value();
                } catch (...) {
                    if (req > 0) {
                        state->signal_milestone_consumed(req);
                    }
                    strive {
                        completion_promises.at(identify).set_exception(std::current_exception());
                    } catch (...) {
                    }
                    throw;
                }
            }));

One std::promise per node, with a std::shared_future so a number of downstream branches can wait on the identical upstream completion with out enjoying pass-the-future. The failure path at all times units the exception, so dependents don’t wait endlessly. Now we have all debugged the choice, and oh boy did we not get pleasure from it!

Discover what’s not on this loop: any logic about prefill, KV, or branches. The orchestrator doesn’t know what a PrefillNode is. It is aware of about names, edges, and guarantees. The node-specific work lives in execute() and is totally polymorphic behind the ExecutionNode digital interface. Just one duty for a kid, not overwhelming in any respect!

Step 3 — Prefill as soon as, export KV

PrefillNode does 4 issues within the following sequence:

Learn the doc textual content from examples/base_doc.txt.
Tokenize it (with the resize-on-negative-return llama idiom).
Decode the tokens in chunks bounded by llama_n_batch(lctx), on sequence lane kSwarmkvPrefixSeqId, with absolute RoPE positions matching absolutely the token index:

// Absolute RoPE place equals index within the full doc token stream.
batch.pos[i] = cur + i;
// Every token belongs to precisely one sequence id checklist.
batch.n_seq_id[i] = 1;
// Bind all doc tokens to the shared prefix sequence lane fixed.
batch.seq_id[i][0] = kSwarmkvPrefixSeqId;
// Disable logits throughout prefill besides we hold zeros for all tokens right here.
batch.logits[i] = 0;

4. Export the prefix-sequence KV into the canonical host buffer and stamp the watermark for the branches.

KVHandoff::bind_contiguous_cache(lctx, state->materialized_branch_buffer);
// Mark prefill_complete so branches utilizing kSwarmkvWaitForPrefillComplete can proceed.
state->mark_prefill_complete();

That’s the total level of the article in two traces. Every thing else on this repo — the orchestrator, the LlamaGuard, the price range examine, the documented no-op — exists to feed these two traces and to ship their output to the branches in a single memcpy with no additional spherical journeys.

Step 4 — Department decode below LlamaGuard

1. Allocate a per-branch buffer sized for a similar n_ctx because the prefix:

// Allocate a department buffer sized for n_ctx so later decode has headroom in the identical blob coverage.
branch_buf = ctx->memory_pool->allocate_branch_cache(static_cast(ctx->ctx_params.n_ctx));

2. Copy the canonical snapshot into the department buffer, a.okay.a., the well-known “copy-on-fork”:

// Full-prefill path copies from canonical staging into the department allocation.
KVHandoff::materialize_branch_cache(
    state->materialized_branch_buffer,
    branch_buf,
    fork_kv_bytes);

…which, if you click on by way of, turns into

std::memcpy(dst_ptr, src_ptr, ncopy);

That’s it. That’s “copy-on-fork on the storage layer”, which accurately is memcpy. It’s the primitive. Every thing fancy you’ve gotten examine prefix sharing — RadixAttention’s reference counting, paged consideration’s block desk indirection — is sitting on high of the identical thought: don’t recompute, copy the bytes.

3. Spin up a recent llama_context and restore the snapshot:

// Restore solely the prefix sequence lane so department decode stays remoted on seq 0.
const size_t n = llama_state_seq_set_data(
    lctx,
    static_cast(base),
    fork_kv_bytes,
    kSwarmkvPrefixSeqId);
    // Confirm llama consumed precisely the variety of bytes we copied into the department buffer.
if (n != fork_kv_bytes) {
    // Free the context earlier than throwing to keep away from leaking VRAM on failure paths.
    llama_free(lctx);
    // Throw with a transparent message so operators can debug dimension mismatches rapidly.
    throw std::runtime_error("AnalyticalNode: llama_state_seq_set_data dimension mismatch.");
}

4. Construct a single llama_batch for the quick department immediate with RoPE positions persevering with from the place the prefix ended:

for (int i = 0; i < batch.n_tokens; ++i) {
    // Copy the i-th department token id into the batch slot.
    batch.token[i] = tokens[static_cast(i)];
    // Place department tokens instantly after the forked prefix positions for proper RoPE.
    batch.pos[i] = static_cast(fork_prefix_len) + static_cast(i);
    // Every token participates in precisely one sequence id checklist entry.
    batch.n_seq_id[i] = 1;
    // Bind all department tokens to the shared prefix sequence lane fixed.
    batch.seq_id[i][0] = kSwarmkvPrefixSeqId;
    // Disable logits for all tokens besides the final one on this department step.
    batch.logits[i] = 0;
}

That is the bit everybody will get incorrect the primary time. If you happen to neglect the offset and begin department positions at zero, rotary embeddings silently go sideways and the mannequin decodes from a place the prefix was by no means skilled for. The symptom is confidently coherent nonsense. Welcome to the worst form of debugging hell; please go away a tip in your means out.

5. Lastly, one llama_decode. Simply write a diagnostic string into PipelineState::node_outputs, report timings_ms[name], free the batch and the context.

Three traces of enterprise logic per department. Two contexts in flight at decode time. One memcpy every. One world lock. Every thing else is plumbing.

5. The receipts (i.e., the numbers)

Now could be the time to guage it in opposition to the baseline, and see if it was value doing all these hassles. All numbers come from examples/example-run-results/.

Fast word on methodology earlier than anybody reaches for the rocks: each comparability under runs the identical mannequin (Qwen2.5-7B-Instruct-Q4_K_M.gguf), the identical doc (a deterministic 3,501-token artificial doc generated by repeating “The fast brown fox jumps over the lazy canine. “ till the token goal is hit — examples/base_doc.txt), the identical GPU (GTX 1080, 8 GiB, Pascal sm_61), the identical n_ctx=4096, and the identical dtype. Baseline = two sequential llama_context situations, every prefilling the complete doc then decoding its department immediate. SwarmKV = PrefillNode as soon as + two AnalyticalNode branches over the snapshot. Workload kind: prefill-dominated doc evaluation (RAG-style), not autoregressive chat. Three trials run back-to-back with a GPU-idle wait between them; the most effective is chosen by 2·TTFT_pct + E2E_pct.

One metric definition earlier than the desk, as a result of it issues: we use “Department-2 activation latency (TTFT proxy)” — not the textbook serving-literature “request-arrival → first-output-token” TTFT. We imply the time the second department spends in branch-specific work: its activation latency after the shared prefill is amortized throughout all branches. In a fan-out pipeline the fee the downstream shopper perceives is strictly this quantity, as a result of the upstream prefill is paid as soon as for the entire pipeline by design. The baseline worth for this metric is the redundant doc prefill that the second llama_context is compelled to redo earlier than it may well reply; the SwarmKV worth is the fork + restore + short-prompt decode.

Headline: GTX 1080, Qwen2.5-7B Q4_K_M, 3,501-token doc, two branches

Metric	Baseline (HF-style)	SwarmKV	Delta
Finish-to-end wall clock	10,275 ms	5,272 ms	−48.69 % (~1.95×)
Department-2 activation latency (TTFT proxy)	4,339 ms	83 ms	−98.09 % (~52.3×)
Baseline Agent-1 prefill	4,346 ms	–	–
Baseline Agent-2 prefill	4,339 ms	–	–
SwarmKV per-branch decode (avg)	–	77 ms	–
Redundant prefill eradicated	–	–	8,685 ms

Translation: the baseline spent 4,339 ms of the second agent’s perceived latency re-doing the dense consideration move it had simply completed 4 seconds earlier on the identical bytes. SwarmKV seems at that and says “what if we didn’t?” and ships an 83-millisecond reply. The cleanest single-number measurement of “how costly was that prefill?” is simply the ratio of these two timings; every part else within the department is a rounding error.

The place the per-branch ~83 ms goes

The thesis of this entire article rests on one inequality: per-branch restore + decode is way, less expensive than a redundant doc prefill. The harness measures this immediately on the combination degree — the per-branch wall clock (allocate + copy + restore + decode, finish to finish) is 71–83 ms relying on which department we have a look at, in opposition to a redundant prefill value of ~4,339 ms. A ~52× ratio on the combination degree is what makes every part else on this article work.

For extra outcomes and numbers, I’ll suggest to take a look at immediately on the instance run report.

6. “OK, however how is that this completely different from vLLM / prefix caching / SGLang RadixAttention?”

A really affordable query, and price answering immediately, as a result of the inference-infra world has a whole lot of overlapping primitives and an HPC reader will ask this within the first remark.

vLLM / steady batching / paged consideration. Optimized for multi-tenant decode-time serving: many concurrent requests at completely different decode steps, scheduling the subsequent token throughout them below streaming load. Headline primitive: paged consideration. Unit of labor: a streaming firehose of impartial person prompts.
TGI / vLLM prefix caching. Wonderful in case your shared prefix is request-scoped or session-scoped. Not designed to reveal KV snapshots as first-class objects you’ll be able to hand to a distinct llama_context operating a distinct downstream activity in the identical course of.
SGLang RadixAttention. Tree-shaped prefix sharing inside a serving runtime — the closest cousin, however it’s a server, not a single-process orchestration primitive.
llama.cpp’s personal state save/restore. Exists, per-context. SwarmKV is the pipeline-level glue: a DAG, a host-buffer area sized by the engine itself, a memcpy fan-out, a LlamaGuard coverage, and a documented no-op patiently awaiting an upstream bind API.

7. So… how do I really strive it?

Properly, I already posted the Github hyperlink initially of the article. If in case you have come up to now down, please work laborious yet another time and scroll again as much as the highest.

Artifacts land below examples/example-run-results/: best_run.json, all_trials.csv, plots/*.png, and a story final_result.docx that walks by way of methodology and limitations.

Necessities: Linux, CUDA toolkit, an NVIDIA GPU (Pascal or newer; shopper or datacenter each work), a GGUF mannequin that matches in your VRAM, and the endurance to learn a CMake file as soon as.

8. Plot twist — that is simply SIB broadcast in a transformer costume

I ought to in all probability confess at this level: I’m not a “GPU particular person” by coaching. I got here up by way of telecom — 5G NR with a foot creeping firmly into 6G analysis — and I began taking a look at LLM inference infrastructure as a result of each downside on this codebase felt unusually acquainted.

One-sentence decoder ring for readers with out a 3GPP background: in a 5G community, the cell tower doesn’t unicast community configuration to each cellphone individually — it broadcasts a small set of System Info Blocks (SIB1, SIB2, …) as soon as on a shared channel, each cellphone in vary reads the identical broadcast, and per-user information rides on high of that shared context on a devoted channel. The acronyms within the desk under — MIB (Grasp Info Block, the very very first thing each cellphone reads), PBCH and PDSCH (the shared broadcast and downlink information channels), HARQ (the receiver’s “hold what we already decoded, solely re-send what was lacking” retransmission protocol), and RNTI (the short-term ID that distinguishes one cellphone’s visitors from one other’s) — are simply names for the channels and identifiers that separate shared, computed as soon as from distinctive per shopper. That distinction is the entire analogy.

Have a look at this side-by-side and inform me with a straight face these are completely different issues:

5G NR cell broadcast (on the gNB)	SwarmKV (on the GPU)
One MIB on PBCH per SS burst	One shared doc tokenized as soon as
Repeated SIBs (SIB1, SIB2, …) on PDSCH	Serialized KV snapshot in MemoryPool
Each camped UE within the cell reads the identical SI	Each analytical department reads the identical snapshot
UE-specific devoted PDSCH for unicast person information	Per-branch `llama_context` decoding the department immediate
RNTI per UE distinguishes unicast streams	Per-branch buffer + sequence id distinguishes department state
HARQ soft-buffer retained throughout retransmissions	KV snapshot retained throughout branches
Skip broadcast → each UE forces unicast SI → air interface melts	Skip snapshot → each department re-prefills the doc → GPU melts

A fast apart to 2 very completely different audiences

To my HPC and CUDA-first buddies studying this: I do know. KV reuse isn’t a brand new thought. vLLM has prefix caching, SGLang has RadixAttention, llama.cpp itself exposes state save/restore. SwarmKV’s contribution isn’t the primitive; it’s the single-process orchestration form — a tiny C++ DAG runtime that exposes “prefill as soon as, fan out N branches” as a first-class operation, sized for one 8 GiB shopper GPU, with the protection rails (LlamaGuard, swarmkv_validate_context_budget, the documented bind no-op) {that a} researcher really must ship a demo on a Tuesday. Please put the pitchforks down.

To my telecom buddies: if “KV cache” gave the impression of a international language till ten minutes in the past, you aren’t behind — you’re early. For twenty years our world was FPGAs, ASICs, and PRBs. We optimized spectrum, not silicon. Then AI-RAN, NWDAF, NVIDIA Aerial, the AI-RAN Alliance, and the 3GPP Rel-20 examine objects all occurred in roughly the identical eighteen months, and the subsequent decade of telecom careers now calls for being bilingual between spectrum-world and GPU-world. The instinct interprets cleanly. You’ve gotten been fanning out shared computation to many customers because the first CRS pilot. Similar animal, only a new zoo.

9. Sincere caveats (as a result of the feedback are coming)

If you happen to got here right here to seek out what’s incorrect with the mission — congratulations, the mission discovered its first reader. From the constraints part of final_result.docx and the inline feedback within the supply:

KV staging is host-side. MemoryPool allocates ggml_backend_buffer_t from the CPU machine (ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU)). Department decode nonetheless runs on the GPU; solely the snapshot transit is host-staged by way of llama_state_get_data → memcpy → llama_state_seq_set_data. A tool-aware materialize lives on the roadmap, blocked on the identical upstream KV bind API that bind_contiguous_cache is ready for.
Shared decode mutex (below the pinned upstream revision). LlamaGuard serializes each llama_* name from employee threads. Beneath the llama.cpp revision and GPU configuration used on this mission, concurrent decode from a number of threads on a single GPU was not reliably secure — the precise behaviour relies on backend, model, and graph scheduling, however in our setup the conservative selection was a world lock. The DAG-level concurrency is actual, however per-request GPU compute stays sequential. That is the only largest efficiency limitation in V1, and it’s precisely the place Half 2 of this collection picks up.
SwarmKV_Prefill_Ms studies 0. Recognized instrumentation hole in how OrchestratorContext::node_name is consumed inside PrefillNode. The prefill ran (you see its value in End_To_End_Ms and the derived efficient shared-prefill value), it’s simply not being keyed accurately into timings_ms. The efficient shared prefill is calculated as SwarmKV_End_To_End_Ms − max(SwarmKV_AgentA_Ms, SwarmKV_AgentB_Ms) ≈ 5,189 ms. Reporting bug, not correctness bug. Logged.
Artificial doc. The benchmark builds a deterministic 3,501-token doc by repeating “The fast brown fox jumps over the lazy canine. ” till the token goal is hit. This isolates the efficiency sign from content material results and retains trials reproducible bit-for-bit. Actual paperwork will produce noisier per-trial absolute timings; the structural ratios won’t transfer.
Single GPU class. All numbers within the report come from one Pascal-class GTX 1080. Newer GPUs (Ada, Hopper) prefill a lot quicker — absolutely the ms numbers will shrink, however the structural ratio between full-prefill value and short-decode value (which is what SwarmKV exploits) doesn’t.
bind_contiguous_cache is a documented no-op. Sure, nonetheless. Till upstream lands a steady external-KV attachment API, the perform validates its arguments, casts them to void, and goes dwelling.

However don’t fear, every part on this checklist is on the roadmap. None of it adjustments the headline consequence although. The purpose of placing it in writing is that you shouldn’t must dig for it — and the second a benchmark weblog put up hides its caveats is the second its numbers cease being reliable.

10. The V1 ceiling (and the setup for Half 2)

SwarmKV proves you could cease re-prefilling. However for those who reread caveat #2, you’ve gotten already noticed the subsequent ceiling: the GPU compute itself remains to be serialized.

Here’s what really occurs on the wall clock. The DAG-level concurrency is real — branches are actual std::async staff with actual dependency gating. However each department’s llama_decode runs inside LlamaGuard, a single world mutex. So whereas the orchestration followers out, the GPU work traces up single file. Two branches take turns. Fifty branches take fifty turns. The GPU is rarely really shared; it’s time-multiplexed by hand, one lock at a time, with no equity assure and no strategy to measure who’s ravenous whom.

That’s high-quality for a two-agent demo. It falls aside the second you run the workload SwarmKV is definitely constructed for: 50 specialised micro-agents competing for one GPU. At that scale you cease caring about “did we keep away from re-prefill” and begin caring about questions a hand-rolled mutex can’t reply:

When 50 brokers need the GPU directly, who goes first, and the way can we make it truthful?
What’s the p50, p95, and p99 latency every agent sees whereas sharing one card?
How a lot jitter does competition add, and the place does throughput collapse?
How can we slice GPU compute cycles on function as an alternative of accidentally?

That’s Half 2 of this collection: Time-Slicing the GPU for Concurrent Agent Swarms. Toy brokers run sequentially in Python. Manufacturing brokers run concurrently on naked metallic, and managing VRAM and compute when many micro-agents share one NVIDIA GPU is its personal self-discipline. Half 2 builds a Kubernetes-level time-slice profiler that dynamically allocates compute cycles and measures p50/p95/p99 latency, jitter, and throughput proxies when agentic inference workloads share a GPU by way of the Kubernetes Machine Plugin with CUDA time-slicing. The worldwide mutex in SwarmKV is strictly the factor it replaces with one thing measurable.

(For the curious: there’s a separate, orthogonal V1 limitation value a future SwarmKV V2 put up — the pipeline at the moment waits for the total prefill to complete earlier than any department begins, even when a department solely wants the primary 500 tokens of context. Letting branches begin the moment their required prefix slice is materialized is an actual win, however it’s its personal story and its personal benchmark. It isn’t Half 2. Half 2 is about sharing the GPU throughout many brokers; that prefill-streaming thought is a follow-up.)

See you in Half 2.

Disclaimer: The illustrations on this article (the hero banner, the structure diagram, the telecom-vs-SwarmKV cut up panel, and the GPU time-slicing picture) have been generated utilizing AI (Claude Opus 4.8). They’re illustrative, not photographic, and any labels seen inside the photographs are stylized moderately than authoritative — discuss with the article physique and the code itself for exact perform names, metric values, and structure particulars.

Prefill As soon as, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines

Admin

Save Almost 40% Off the PlayStation 5 Marathon Version DualSense Controller Throughout Sony's Days of Play Sale

Leave a Reply Cancel reply

Trending.

OpenAI Warns GPT-5.6 File Deletions Stem From Full Entry Mode

Ex-Activision Boss Bobby Kotick Needs To Purchase TikTok

These 5 Easy Methods Helped Me Construct a Smarter House

Day 10 — Understanding Ensemble Strategies: Random Forest vs. Gradient Boosting | by Jovite Jeffrin A | Aug, 2025

Information transient: Nation-state threats evolve and escalate

TechTrendFeed

Categories

Recent News

The Obtain: OpenAI’s predictable hack, and an AI inventory sell-off

How AgentCore Gateway helps the MCP 2026-07-28 spec

Prefill As soon as, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines

Key takeaways

1. A confession: most of your second agent’s “work” is a rerun

2. Why does prefill exist in any respect? (a one-minute crash course)

3. The “simply snapshot the KV” lightbulb (and why it’s more durable than it sounds)

Drawback A: How large is the KV?

Drawback B: llama.cpp is a choosy eater about concurrent decode

Drawback C: There is no such thing as a steady exterior KV bind API

4. The five-step pipeline (the actually-cool half)

Step 0 — Fail-fast context price range

Step 1 — DAG cycle detection

Step 2 — One std::async per node, gated on shared futures

Step 3 — Prefill as soon as, export KV

Step 4 — Department decode below LlamaGuard

The place the per-branch ~83 ms goes

6. “OK, however how is that this completely different from vLLM / prefix caching / SGLang RadixAttention?”

7. So… how do I really strive it?

8. Plot twist — that is simply SIB broadcast in a transformer costume

A fast apart to 2 very completely different audiences

9. Sincere caveats (as a result of the feedback are coming)

10. The V1 ceiling (and the setup for Half 2)

Leave a Reply Cancel reply

Trending.

TechTrendFeed

Categories

Recent News

Step 2 — One `std::async` per node, gated on shared futures