{"id":15568,"date":"2026-06-09T14:35:17","date_gmt":"2026-06-09T14:35:17","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=15568"},"modified":"2026-06-09T14:35:18","modified_gmt":"2026-06-09T14:35:18","slug":"prefill-as-soon-as-fan-out-kv-snapshot-sharing-for-multi-agent-llm-pipelines","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=15568","title":{"rendered":"Prefill As soon as, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p class=\"wp-block-paragraph\"><em>A humorous-but-real tour of SwarmKV \u2014 KV-snapshot fan-out, copy-on-fork host buffers, and tips on how to make a two-agent analytical pipeline ~1.95\u00d7 quicker (and the second department\u2019s activation latency 52\u00d7 quicker) by being mildly imply to llama.cpp.<\/em><\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"has-spindle-background-color has-background has-underline-2-font-size wp-block-paragraph\"><strong>of the \u201cManufacturing-Grade Agentic Inference\u201d collection<\/strong>. Every half removes one form of redundant work from an agentic LLM pipeline. Half 1 (this put up) kills redundant prefill. Half 2 tackles redundant ready \u2014 how 50 micro-agents share one GPU by way of time-slicing. Half 3 retains RAG retrieval on the GPU with a customized CUDA Prime-Ok kernel. Half 4 persists agent state throughout hand-offs so the subsequent agent by no means has the cold-start downside.<\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Key takeaways<\/h2>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<p class=\"wp-block-paragraph\"><strong>The issue:<\/strong> when a number of brokers learn the identical doc, a default serving stack makes each rerun the <em>very same<\/em> prefill. That redundant dense consideration move is pure waste. <\/p>\n<p class=\"wp-block-paragraph\"><strong>The repair:<\/strong> run prefill as soon as, serialize the KV cache to a number buffer, <code>memcpy<\/code> it per department, and restore it earlier than decoding. \u201cCompute as soon as, fan out.\u201d <\/p>\n<p class=\"wp-block-paragraph\"><strong>The receipts:<\/strong> on a seven-year-old GTX 1080, a two-agent pipeline obtained <em>48.69% quicker finish to finish (~1.95\u00d7)<\/em> and the second agent\u2019s activation latency dropped <em>98.09% (~52\u00d7)<\/em>, eliminating <em>8,685 ms<\/em> of redundant compute. <\/p>\n<p class=\"wp-block-paragraph\"><strong>The kicker:<\/strong> this isn&#8217;t a brand new algorithm. It&#8217;s methods engineering \u2014 and it&#8217;s the identical \u201cbroadcast shared state as soon as\u201d choice a 5G cell tower has made each 80 ms since LTE.<\/p>\n<p class=\"wp-block-paragraph\"><strong>TL;DR<\/strong>: Normal LLM serving makes each analytical agent re-prefill the identical shared doc. Your GPU dutifully re-executes billions of redundant prefix-prefill multiplications. The identical bytes. The identical weights. The identical quantization. All to recalculate a state it already completed calculating 4 seconds in the past. SwarmKV runs the prefill <em>as soon as<\/em>, serializes the ensuing KV state to a number buffer by way of <code>llama_state_get_data<\/code>, <code>memcpy<\/code>s that buffer right into a per-branch allocation, and lets every department restore the snapshot with <code>llama_state_seq_set_data<\/code> earlier than decoding from the place the doc left off. Sure, it&#8217;s a actual round-trip\u2014serialize, copy, restore, however as a result of redundant prefill compute scales quadratically whereas the KV state switch scales linearly, transferring the info throughout a constrained Pascal reminiscence bus remains to be vastly cheaper than recalculating the eye matrices from scratch. That&#8217;s mirrored on the consequence on a seven-year-old GTX 1080: 48.69% end-to-end speedup on a two-agent pipeline, <em>98.09% discount in branch-2 activation latency (~52\u00d7)<\/em>, 8,685 ms of redundant dense compute eradicated, zero new transformer tips. Simply the systems-engineering place that \u201ccompute as soon as, fan out\u201d beats \u201ccompute N occasions, hope no one notices.\u201d<\/p>\n<figure class=\"wp-block-pullquote\">\n<blockquote>\n<p><strong>Github Repo<\/strong>: <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/AnubhabBanerjee\/swarmkv\">https:\/\/github.com\/AnubhabBanerjee\/swarmkv<\/a><\/p>\n<\/blockquote>\n<\/figure>\n<p class=\"wp-block-paragraph\"><em>(Fast confession earlier than we begin: I got here at this from a 5G\/6G RAN engineering background. Because it seems, fanning out a shared computation to many downstream customers is shockingly near what a cell tower has been doing each 80 ms since LTE when it broadcasts SIB1. There\u2019s a complete part on that under \u2014 part 8 \u2014 nevertheless it\u2019s additionally why I\u2019m penning this within the first place.)<\/em><\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>Structure psychological mannequin<\/strong> \u2014 hold this open when you learn.<\/p>\n<p class=\"wp-block-paragraph\"><code>Doc \u2192 PrefillNode \u2192 llama_state_get_data \u2192 host KV buffer \u2192 memcpy per department \u2192 llama_state_seq_set_data \u2192 AnalyticalNode decode (RoPE continues at prefix_seq_len)<\/code><\/p>\n<p class=\"wp-block-paragraph\">Every thing under is simply commentary on one a part of that line.<\/p>\n<\/blockquote>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/architecture_diagram-1024x683.png\" alt=\"\" class=\"wp-image-665042\"\/><figcaption class=\"wp-element-caption\">SwarmKV architectural overview<\/figcaption><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">1. A confession: most of your second agent\u2019s \u201cwork\u201d is a rerun<\/h2>\n<p class=\"wp-block-paragraph\">If in case you have ever pointed two analytical brokers on the identical doc by way of vanilla llama.cpp, here&#8217;s what actually occurs (with a bit of little bit of intentional dramatization):<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">You: <em>\u201cPlease present an outline of this 3,500-token spec, and individually, checklist its license obligations.\u201d<\/em><\/p>\n<p class=\"wp-block-paragraph\">llama.cpp (Agent 1): <em>\u201cCertain. Loading mannequin. Prefilling doc. Decoding reply.\u201d<\/em><\/p>\n<p class=\"wp-block-paragraph\">GPU spends 4,346 ms on dense consideration<\/p>\n<p class=\"wp-block-paragraph\">llama.cpp (Agent 1): <em>\u201cExecuted. Right here\u2019s a 6-token reply.\u201d<\/em><\/p>\n<p class=\"wp-block-paragraph\">You: <em>\u201cNice. Now Agent 2.\u201d<\/em><\/p>\n<p class=\"wp-block-paragraph\">llama.cpp (Agent 2): <em>\u201cCertain. Loading mannequin. Prefilling doc \u2014 \u201c<\/em><\/p>\n<p class=\"wp-block-paragraph\">You: <em>\u201cWait, you actually simply did that.\u201d<\/em><\/p>\n<p class=\"wp-block-paragraph\">llama.cpp (Agent 2): <em>\u201cI\u2019m an impartial <code>llama_context<\/code>. I&#8217;ve no reminiscence of Agent 1. I&#8217;ve no reminiscence of something. I&#8217;m a lovely, stateless new child.\u201d \ud83e\udee1<\/em><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">GPU spends one other 4,339 ms on bit-for-bit an identical consideration math.<\/p>\n<p class=\"wp-block-paragraph\">Your GPU thermal sensor: will get a great exercise; <br \/>Your AWS invoice: <em>develops a humorousness<\/em>;<br \/>Your second agent\u2019s TTFT: 4.3 seconds earlier than it may well reply a 4-token query.<\/p>\n<p class=\"wp-block-paragraph\">That&#8217;s the joke. That&#8217;s the soiled secret of each \u201cagentic\u201d pipeline that spreads out from a shared doc. Every department begins from a clean slate and rebuilds the identical KV cache the earlier department simply completed constructing. The deeper the doc, the more severe the tax. At 3,500 tokens on a Pascal GPU, the <em>huge<\/em> majority of the second agent\u2019s perceived latency isn&#8217;t the reply \u2014 it&#8217;s studying the doc once more.<\/p>\n<p class=\"wp-block-paragraph\">SwarmKV is what occurs if you resolve the second studying is optionally available and you&#8217;d moderately write 1,500 traces of C++ than let every agent construct the identical KV cache time and again.<\/p>\n<p class=\"wp-block-paragraph\">Now think about, the toy demo on this repo is about two brokers over a abstract\/license-check. The actual form of the workload it&#8217;s constructed for is <em>N specialised evaluators over one dense technical doc<\/em>. Image an AI patent and prior-art pipeline: one 50,000-token technical specification on the root, and fifty concurrent branches evaluating novelty, mapping claims, retrieving prior artwork, checking freedom-to-operate, assessing moral compliance, and translating into jurisdiction-specific language. The baseline value of that pipeline on a default serving stack is <em>fifty full prefills<\/em> of the identical spec. The SwarmKV value is one prefill plus fifty <code>memcpy<\/code>s. That asymmetry is deliberately designed, and your complete purpose the repo exists. I&#8217;ve written individually about <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/medium.com\/@anbdwnroop.banerjee\/has-ai-entered-the-inventors-notebook-4766738e685c\" target=\"_blank\" rel=\"noreferrer noopener\">detecting AI in invention reporting<\/a> \u2014 that is the infrastructure half of that half. Inventor\u2019s-notebook issues are precisely the rationale SwarmKV is constructed round.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">2. Why does prefill exist in any respect? (a one-minute crash course)<\/h2>\n<p class=\"wp-block-paragraph\">Skip this for those who already know. For everybody else, right here is the quick model.<\/p>\n<p class=\"wp-block-paragraph\">An autoregressive LLM serves a request in two phases. <em>Prefill<\/em> is the dense move that pushes each immediate token by way of each transformer layer as soon as and populates the per-layer key\/worth (KV) cache. <em>Decode<\/em> then runs token by token, attending to the prefilled KV cache and rising it incrementally.<\/p>\n<p class=\"wp-block-paragraph\">Prefill value grows roughly linearly with immediate size. Decode, by comparability, is reasonable per token. On a Pascal-class GTX 1080 operating Qwen2.5-7B Q4_K_M, prefilling a ~3,500-token doc takes about <em>4.3 seconds<\/em>; decoding a brief department immediate takes a whole lot of milliseconds, since it&#8217;s dominated by setup, not arithmetic. That point distinction between the prefill and decode is strictly the leverage SwarmKV makes use of.<\/p>\n<p class=\"wp-block-paragraph\">Mainstream serving stacks (vLLM, TGI, SGLang, llama.cpp\u2019s personal server) deal with each request as an impartial context. A few of them have <em>prefix caching<\/em>, however it&#8217;s normally request-scoped or session-scoped \u2014 not graph-scoped. They&#8217;re constructed to maximise throughput throughout many impartial person prompts, to not share state inside one analytical pipeline that followers out from a single shared doc. For that DAG-shaped workload \u2014 <em>one root, many leaves, identical information<\/em> \u2014 each public stack I attempted made me pay for the foundation as soon as per leaf.<\/p>\n<p class=\"wp-block-paragraph\">SwarmKV serves as the express orchestration layer, leveraged in C++ to bypass runtime abstractions, assure deterministic pointer life-cycles, and drive hardware-level <code>memcpy<\/code> effectivity.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">3. The \u201csimply snapshot the KV\u201d lightbulb (and why it\u2019s more durable than it sounds)<\/h2>\n<p class=\"wp-block-paragraph\">The pitch is easy:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Run prefill as soon as on the shared doc below sequence id <code>kSwarmkvPrefixSeqId<\/code>.<\/li>\n<li class=\"wp-block-list-item\">Serialize the ensuing KV state into a number buffer by way of <code>llama_state_get_data<\/code>.<\/li>\n<li class=\"wp-block-list-item\">For every downstream department, <code>memcpy<\/code> that buffer right into a per-branch allocation.<\/li>\n<li class=\"wp-block-list-item\">Spin up a recent <code>llama_context<\/code>, name <code>llama_state_seq_set_data<\/code> to put in the snapshot, then decode the department immediate with RoPE positions persevering with from <code>prefix_seq_len<\/code>.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">That is the \u2018compute as soon as, fan out\u2019 paradigm. The one purpose it takes greater than a 30-line <code>llama.cpp<\/code> patch to realize is that three tedious edge instances instantly break the naive method. The idea is superbly easy and needs to be a simple weekend mission, however low-level {hardware} and methods realities make it a large engineering problem to really implement.<\/p>\n<h3 class=\"wp-block-heading\">Drawback A: How large is the KV?<\/h3>\n<p class=\"wp-block-paragraph\">A simple reply: <code>n_layers \u00d7 n_head_kv \u00d7 n_ctx \u00d7 head_dim \u00d7 dtype \u00d7 2<\/code>. Properly, that hand-derived quantity drifts each time the quantization format adjustments, each time the GQA ratio adjustments, or each time the engine provides a brand new state subject. The one sincere quantity is the one the engine tells you below the <em>present<\/em> construct.<\/p>\n<p class=\"wp-block-paragraph\">So <code>MemoryPool<\/code> spins up a disposable <code>llama_context<\/code> solely to ask:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">size_t MemoryPool::get_required_kv_size(uint32_t n_ctx) {\n    \/\/ Begin from library defaults so fields we don't care about stay sane.\n    llama_context_params params = llama_context_default_params();\n    params.n_ctx = n_ctx;\n\n    \/\/ Assemble a disposable context solely to question serialized state footprint.\n    llama_context * ctx = llama_init_from_model(model_ref, params);\n    if (!ctx) {\n        throw std::runtime_error(\"MemoryPool::get_required_kv_size: llama_init_from_model failed.\");\n    }\n\n    \/\/ Ask llama.cpp what number of bytes a full state blob would occupy for this ctx.\n    const size_t sz = llama_state_get_size(ctx);\n    llama_free(ctx);\n\n    \/\/ If the engine studies zero, fall again to a small non-zero allocation so checks\n    \/\/ nonetheless train the registry with out pretending we all know actual tensor layouts.\n    if (sz == 0) {\n        return size_t{1} &lt;&lt; 20;\n    }\n    return sz;\n}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The logic right here is easy: ask the engine politely as an alternative of trusting a pdf or a mathematical system. Simply spin up a context, ask the dimensions and allocate precisely that a lot \u2013 a easy but very profitable recipe.<\/p>\n<h3 class=\"wp-block-heading\">Drawback B: llama.cpp is a choosy eater about concurrent decode<\/h3>\n<p class=\"wp-block-paragraph\">Beneath the pinned upstream <code>llama.cpp<\/code> revision and GPU configuration used on this mission, concurrent decode from a number of threads on a single GPU was not reliably secure. The precise behaviour relies on the backend, the model, and the graph scheduler \u2014 in newer revisions or with remoted streams it could behave higher \u2014 however in our setup the failure modes have been one among: (a) crash, (b) corrupted KV, or (c) a ten-minute hold when you Google whether or not ggml has a thread-local area but. Spoiler: within the pinned upstream, not likely.<\/p>\n<p class=\"wp-block-paragraph\">The sturdy reply is to serialize the llama API floor on the boundary:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">namespace swarmkv {\n\n\/\/ llama.cpp CUDA paths usually are not secure for concurrent decode from a number of threads\n\/\/ on one GPU with out exterior serialization. All node execute() our bodies should\n\/\/ maintain this mutex round llama_init \/ llama_decode \/ llama_free \/ state I\/O.\ninline std::mutex &amp; llama_api_mutex() {\n    static std::mutex m;\n    return m;\n}\n\n} \/\/ namespace swarmkv\n\nstruct LlamaGuard {\n    std::lock_guard<:mutex> lock;\n    LlamaGuard() : lock(swarmkv::llama_api_mutex()) {}\n};<\/:mutex><\/code><\/pre>\n<p class=\"wp-block-paragraph\">A easy 20-line header defines your complete concurrency coverage. Each node\u2019s <code>execute()<\/code> physique holds this round <code>llama_init_from_model<\/code> \/ <code>llama_decode<\/code> \/ <code>llama_state_seq_set_data<\/code> \/ <code>llama_free<\/code>. The DAG-level concurrency is actual (futures, dependencies, fanout); the <em>GPU compute<\/em> interleaves below a world lock. Pedants will accurately word this leaves perf on the ground in comparison with a hypothetical concurrent-decode upstream. Maintain that thought \u2014 it&#8217;s the actual bottleneck Half 2 of this collection goes after.<\/p>\n<h3 class=\"wp-block-heading\">Drawback C: There is no such thing as a steady exterior KV bind API<\/h3>\n<p class=\"wp-block-paragraph\">The aesthetically excellent implementation could be to allocate one contiguous KV buffer, connect it to the brand new context immediately after which skip the <code>memcpy<\/code> solely. Upstream <code>llama.cpp<\/code> exposes <code>llama_memory_t<\/code> and graph decode paths, however the public header pinned on this repo doesn&#8217;t ship a steady, exported <code>llama_kv_cache_bind<\/code>-style image.<\/p>\n<p class=\"wp-block-paragraph\">So SwarmKV does the next-best factor: it retains the decision web site, names it actually, and writes this path on high of <code>llama_state_set_data<\/code> as an alternative.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">void KVHandoff::bind_contiguous_cache(llama_context * ctx, ggml_backend_buffer_t cache) {\n    \/\/ Validate arguments so misuse fails quick throughout bring-up and CI smoke runs.\n    \/\/ A null context can't decode; a null cache deal with is a configuration bug.\n    if (!ctx || !cache) {\n        throw std::invalid_argument(\"KVHandoff::bind_contiguous_cache: null context or buffer.\");\n    }\n\n    \/\/ Explicitly mark each parameters as deliberately unused on this revision.\n    \/\/ This prevents -Wunused-parameter warnings below strict warning flags.\n    (void) ctx;\n    (void) cache;\n\n    \/\/ No steady bind name is issued right here; see file-level remark above.\n    \/\/ When upstream provides a supported attachment API, implement it solely on this perform.\n}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">I do know, I do know. It&#8217;s a perform that does nothing. It has full argument validation, a docstring twice the size of the physique, and a steady place within the name graph. It&#8217;s patiently ready for the day upstream lets it really do its job. I&#8217;ve written extra sincere code in my life, I simply can&#8217;t bear in mind when!<\/p>\n<p class=\"wp-block-paragraph\">That is additionally the half the place cautious readers go <em>\u201cwait, if <code>bind_contiguous_cache<\/code> is a no-op, what&#8217;s the <code>MemoryPool<\/code> buffer even for?\u201d<\/em> Wonderful query. It&#8217;s the <em>staging space<\/em> \u2014 the canonical buffer the place PrefillNode writes its <code>llama_state_get_data<\/code> blob, and the supply that every department <code>memcpy<\/code>s from. Decode itself makes use of the context\u2019s internally-managed KV. Pool buffer = host-side fan-out scratch; context KV = the engine\u2019s personal factor. Two reminiscence areas, one snapshot, zero magic.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">4. The five-step pipeline (the actually-cool half)<\/h2>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-git\">Step 0:  Validate doc + max_branch + 128 \u2264 n_ctx        (context_budget.h, fail-fast)\nStep 1:  Construct the DAG; DFS-check for cycles            (Orchestrator)\nStep 2:  Spawn std::async staff; gate on futures      (Orchestrator)\nStep 3:  Prefill as soon as, serialize KV to host buffer      (PrefillNode + MemoryPool)\nStep 4:  memcpy snapshot \u2192 department buffer \u2192 decode       (AnalyticalNode + KVHandoff)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let\u2019s stroll by way of each with the actual code. The snippets have been saved quick intentionally, whereas the complete information are tiny and price studying.<\/p>\n<h3 class=\"wp-block-heading\">Step 0 \u2014 Fail-fast context price range<\/h3>\n<p class=\"wp-block-paragraph\">Three traces that prevent from a 3 AM Slack message out of your previous self:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">const int32_t required = prefix_tokens + max_branch + generation_headroom;\n    if (required &gt; restrict) {\n        throw std::runtime_error(\n            \"Context price range exceeded: prefix_tokens=\" + std::to_string(prefix_tokens) +\n            \" max_branch_tokens=\" + std::to_string(max_branch) +\n            \" headroom=\" + std::to_string(generation_headroom) +\n            \" required=\" + std::to_string(required) + \" n_ctx=\" + std::to_string(restrict));\n    }<\/code><\/pre>\n<p class=\"wp-block-paragraph\">This runs earlier than any context is constructed, any pool buffer is allotted or any GPU reminiscence is touched. If you happen to ask <em>SwarmKV<\/em> to prefill 4,000 tokens into an <code>n_ctx=4096<\/code> context with two branches and 128 tokens of decode headroom, it tells you the maths doesn&#8217;t work and goes to sleep. The kindest factor you are able to do in your future self is to reject inconceivable configurations earlier than even allocating the primary byte.<\/p>\n<h3 class=\"wp-block-heading\">Step 1 \u2014 DAG cycle detection<\/h3>\n<p class=\"wp-block-paragraph\">The orchestrator does a normal 3-color DFS on the dependency adjacency checklist:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">\/\/ dfs lambda walks adjacency lists and throws when a back-edge signifies a cycle.\n    auto dfs = [&amp;](auto self, const std::string &amp; u) -&gt; void {\n        \/\/ Mark node u as at the moment on the recursion stack (visiting).\n        state[u] = 1;\n        \/\/ Discover all outgoing dependency edges from u to downstream nodes v.\n        for (const auto &amp; v : adj[u]) {\n            \/\/ If v is visiting, we discovered a cycle u -&gt; v and should abort pipeline setup.\n            if (state[v] == 1) {\n                \/\/ Throw with edge names so graph misconfiguration is straightforward to diagnose.\n                throw std::runtime_error(\"Dependency cycle detected: \" + u + \" -&gt; \" + v);\n            }\n            \/\/ Recurse solely when v has not been totally processed but.\n            if (state[v] == 0) {\n                \/\/ Proceed DFS from baby node v.\n                self(self, v);\n            }\n        }<\/code><\/pre>\n<p class=\"wp-block-paragraph\">I do know it\u2019s boring, however belief me, it\u2019s needed. It&#8217;s the algorithmic equal of checking your shoelaces earlier than operating. Skip it as soon as and your pipeline will spend the remainder of its quick life ready on itself. The error message consists of the offending edge, so you&#8217;ll find the typo with out grepping.<\/p>\n<h3 class=\"wp-block-heading\">Step 2 \u2014 One <code>std::async<\/code> per node, gated on shared futures<\/h3>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">worker_tasks.push_back(std::async(\n            std::launch::async,\n            [this, name, state, dependencies, &amp;completion_promises, &amp;completion_futures]() {\n                \/\/ Learn this node's watermark requirement as soon as for dependency gating selections.\n                const int32_t req = nodes.at(identify)-&gt;required_prefix_tokens();\n                \/\/ Look ahead to every upstream dependency based on V2 watermark guidelines.\n                for (const auto &amp; dep_name : dependencies) {\n                    \/\/ Resolve upstream node pointer for prefill supplier detection.\n                    ExecutionNode * dep = nodes.at(dep_name).get();\n                    \/\/ If upstream is prefiller and this department makes use of watermark gating, wait on watermark.\n                    if (dep-&gt;is_prefill_provider() &amp;&amp; req &gt;= 0) {\n                        \/\/ Block till PipelineState watermark &gt;= required_prefix_tokens (speculative begin).\n                        state-&gt;wait_for_watermark(req);\n                    } else {\n                        \/\/ In any other case protect V1 conduct: wait till upstream node thread completes.\n                        completion_futures.at(dep_name).wait();\n                    }\n                }\n                \/\/ Construct llama_context_params with orchestrator default n_ctx price range.\n                llama_context_params params = llama_context_default_params();\n                \/\/ Carry n_ctx to SwarmKV default pipeline context for multi-k token paperwork.\n                params.n_ctx = kSwarmkvDefaultPipelineCtx;\n                \/\/ Bundle mannequin\/pool\/identify into OrchestratorContext for node execute().\n                OrchestratorContext ctx = {\n                    this-&gt;memory_pool-&gt;get_model(),\n                    params,\n                    this-&gt;memory_pool,\n                    identify.c_str(),\n                };\n                \/\/ Run node logic and fulfill promise so dependents can proceed.\n                strive {\n                    \/\/ Dispatch to PrefillNode or AnalyticalNode implementation.\n                    nodes.at(identify)-&gt;execute(state, &amp;ctx);\n                    \/\/ Sign profitable completion to shared_future waiters.\n                    completion_promises.at(identify).set_value();\n                } catch (...) {\n                    if (req &gt; 0) {\n                        state-&gt;signal_milestone_consumed(req);\n                    }\n                    strive {\n                        completion_promises.at(identify).set_exception(std::current_exception());\n                    } catch (...) {\n                    }\n                    throw;\n                }\n            }));<\/code><\/pre>\n<p class=\"wp-block-paragraph\">One <code>std::promise<\/code> per node, with a <code>std::shared_future<\/code> so a number of downstream branches can wait on the identical upstream completion with out enjoying pass-the-future. The failure path at all times units the exception, so dependents don&#8217;t wait endlessly. Now we have all debugged the choice, and oh boy did we not get pleasure from it!<\/p>\n<p class=\"wp-block-paragraph\">Discover what&#8217;s <em>not<\/em> on this loop: any logic about prefill, KV, or branches. The orchestrator doesn&#8217;t know what a <code>PrefillNode<\/code> is. It is aware of about <em>names<\/em>, <em>edges<\/em>, and <em>guarantees<\/em>. The node-specific work lives in <code>execute()<\/code> and is totally polymorphic behind the <code>ExecutionNode<\/code> digital interface. Just one duty for a kid, not overwhelming in any respect!<\/p>\n<h3 class=\"wp-block-heading\">Step 3 \u2014 Prefill as soon as, export KV<\/h3>\n<p class=\"wp-block-paragraph\">PrefillNode does 4 issues within the following sequence:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Learn the doc textual content from <code>examples\/base_doc.txt<\/code>.<\/li>\n<li class=\"wp-block-list-item\">Tokenize it (with the resize-on-negative-return llama idiom).<\/li>\n<li class=\"wp-block-list-item\">Decode the tokens in chunks bounded by <code>llama_n_batch(lctx)<\/code>, on sequence lane <code>kSwarmkvPrefixSeqId<\/code>, with absolute RoPE positions matching absolutely the token index:<\/li>\n<\/ol>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">\/\/ Absolute RoPE place equals index within the full doc token stream.\nbatch.pos[i] = cur + i;\n\/\/ Every token belongs to precisely one sequence id checklist.\nbatch.n_seq_id[i] = 1;\n\/\/ Bind all doc tokens to the shared prefix sequence lane fixed.\nbatch.seq_id[i][0] = kSwarmkvPrefixSeqId;\n\/\/ Disable logits throughout prefill besides we hold zeros for all tokens right here.\nbatch.logits[i] = 0;<\/code><\/pre>\n<p class=\"wp-block-paragraph\">4. Export the prefix-sequence KV into the canonical host buffer and stamp the watermark for the branches. <\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">KVHandoff::bind_contiguous_cache(lctx, state-&gt;materialized_branch_buffer);\n\/\/ Mark prefill_complete so branches utilizing kSwarmkvWaitForPrefillComplete can proceed.\nstate-&gt;mark_prefill_complete();<\/code><\/pre>\n<p class=\"wp-block-paragraph\">That&#8217;s the total level of the article in two traces. Every thing else on this repo \u2014 the orchestrator, the LlamaGuard, the price range examine, the documented no-op \u2014 exists to feed these two traces and to ship their output to the branches in a single <code>memcpy<\/code> with no additional spherical journeys.<\/p>\n<h3 class=\"wp-block-heading\">Step 4 \u2014 Department decode below LlamaGuard<\/h3>\n<p class=\"wp-block-paragraph\">1. Allocate a per-branch buffer sized for a similar n_ctx because the prefix:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">\/\/ Allocate a department buffer sized for n_ctx so later decode has headroom in the identical blob coverage.\nbranch_buf = ctx-&gt;memory_pool-&gt;allocate_branch_cache(static_cast<uint32_t>(ctx-&gt;ctx_params.n_ctx));<\/uint32_t><\/code><\/pre>\n<p class=\"wp-block-paragraph\">2. Copy the canonical snapshot into the department buffer, a.okay.a., the well-known \u201ccopy-on-fork\u201d:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">\/\/ Full-prefill path copies from canonical staging into the department allocation.\nKVHandoff::materialize_branch_cache(\n    state-&gt;materialized_branch_buffer,\n    branch_buf,\n    fork_kv_bytes);<\/code><\/pre>\n<p class=\"wp-block-paragraph\">\u2026which, if you click on by way of, turns into<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">std::memcpy(dst_ptr, src_ptr, ncopy);<\/code><\/pre>\n<p class=\"wp-block-paragraph\">That&#8217;s it. That&#8217;s \u201ccopy-on-fork on the storage layer\u201d, which accurately is <code>memcpy<\/code>. It&#8217;s the primitive. Every thing fancy you&#8217;ve gotten examine prefix sharing \u2014 RadixAttention\u2019s reference counting, paged consideration\u2019s block desk indirection \u2014 is sitting on high of the identical thought: <em>don\u2019t recompute, copy the bytes<\/em>.<\/p>\n<p class=\"wp-block-paragraph\">3. Spin up a recent <code>llama_context<\/code> and restore the snapshot:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">\/\/ Restore solely the prefix sequence lane so department decode stays remoted on seq 0.\nconst size_t n = llama_state_seq_set_data(\n    lctx,\n    static_cast<const uint8_t=\"\">(base),\n    fork_kv_bytes,\n    kSwarmkvPrefixSeqId);\n    \/\/ Confirm llama consumed precisely the variety of bytes we copied into the department buffer.\nif (n != fork_kv_bytes) {\n    \/\/ Free the context earlier than throwing to keep away from leaking VRAM on failure paths.\n    llama_free(lctx);\n    \/\/ Throw with a transparent message so operators can debug dimension mismatches rapidly.\n    throw std::runtime_error(\"AnalyticalNode: llama_state_seq_set_data dimension mismatch.\");\n}<\/const><\/code><\/pre>\n<p class=\"wp-block-paragraph\">4. Construct a single <code>llama_batch<\/code> for the quick department immediate with RoPE positions <em>persevering with<\/em> from the place the prefix ended:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-cpp\">for (int i = 0; i &lt; batch.n_tokens; ++i) {\n    \/\/ Copy the i-th department token id into the batch slot.\n    batch.token[i] = tokens[static_cast<size_t>(i)];\n    \/\/ Place department tokens instantly after the forked prefix positions for proper RoPE.\n    batch.pos[i] = static_cast<llama_pos>(fork_prefix_len) + static_cast<llama_pos>(i);\n    \/\/ Every token participates in precisely one sequence id checklist entry.\n    batch.n_seq_id[i] = 1;\n    \/\/ Bind all department tokens to the shared prefix sequence lane fixed.\n    batch.seq_id[i][0] = kSwarmkvPrefixSeqId;\n    \/\/ Disable logits for all tokens besides the final one on this department step.\n    batch.logits[i] = 0;\n}<\/llama_pos><\/llama_pos><\/size_t><\/code><\/pre>\n<p class=\"wp-block-paragraph\">That is the bit everybody will get incorrect the primary time. If you happen to neglect the offset and begin department positions at zero, rotary embeddings silently go sideways and the mannequin decodes from a place the prefix was by no means skilled for. The symptom is confidently coherent nonsense. Welcome to the worst form of debugging hell; please go away a tip in your means out.<\/p>\n<p class=\"wp-block-paragraph\">5. Lastly, one <code>llama_decode<\/code>. Simply write a diagnostic string into <code>PipelineState::node_outputs<\/code>, report <code>timings_ms[name]<\/code>, free the batch and the context.<\/p>\n<p class=\"wp-block-paragraph\">Three traces of enterprise logic per department. Two contexts in flight at decode time. One memcpy every. One world lock. Every thing else is plumbing.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\"><a rel=\"nofollow\" target=\"_blank\" href=\"#5-the-receipts-ie-the-numbers\"\/>5. The receipts (i.e., the numbers)<\/h2>\n<p class=\"wp-block-paragraph\">Now could be the time to guage it in opposition to the baseline, and see if it was value doing all these hassles. All numbers come from <code>examples\/example-run-results\/<\/code>.<\/p>\n<p class=\"wp-block-paragraph\">Fast word on methodology earlier than anybody reaches for the rocks: each comparability under runs the <strong>identical<\/strong> mannequin (<code>Qwen2.5-7B-Instruct-Q4_K_M.gguf<\/code>), the <strong>identical<\/strong> doc (a deterministic 3,501-token artificial doc generated by repeating <em>\u201cThe fast brown fox jumps over the lazy canine. \u201c<\/em> till the token goal is hit \u2014 <code>examples\/base_doc.txt<\/code>), the <strong>identical<\/strong> GPU (GTX 1080, 8 GiB, Pascal sm_61), the <strong>identical<\/strong> <code>n_ctx=4096<\/code>, and the <strong>identical<\/strong> dtype. Baseline = two sequential <code>llama_context<\/code> situations, every prefilling the complete doc then decoding its department immediate. SwarmKV = <code>PrefillNode<\/code> as soon as + two <code>AnalyticalNode<\/code> branches over the snapshot. Workload kind: prefill-dominated doc evaluation (RAG-style), not autoregressive chat. Three trials run back-to-back with a GPU-idle wait between them; the most effective is chosen by <code>2\u00b7TTFT_pct + E2E_pct<\/code>. <\/p>\n<p class=\"wp-block-paragraph\">One metric definition earlier than the desk, as a result of it issues: we use <strong>\u201cDepartment-2 activation latency (TTFT proxy)\u201d<\/strong> \u2014 not the textbook serving-literature \u201crequest-arrival \u2192 first-output-token\u201d TTFT. We imply <em>the time the second department spends in branch-specific work<\/em>: its activation latency <em>after<\/em> the shared prefill is amortized throughout all branches. In a fan-out pipeline the fee the downstream shopper perceives is strictly this quantity, as a result of the upstream prefill is paid as soon as for the entire pipeline by design. The baseline worth for this metric is the redundant doc prefill that the second <code>llama_context<\/code> is compelled to redo earlier than it may well reply; the SwarmKV worth is the fork + restore + short-prompt decode.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Headline: GTX 1080, Qwen2.5-7B Q4_K_M, 3,501-token doc, two branches<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th>Metric<\/th>\n<th>Baseline (HF-style)<\/th>\n<th>SwarmKV<\/th>\n<th>Delta<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Finish-to-end wall clock<\/td>\n<td>10,275 ms<\/td>\n<td>5,272 ms<\/td>\n<td><strong>\u221248.69 %<\/strong> (~<strong>1.95\u00d7<\/strong>)<\/td>\n<\/tr>\n<tr>\n<td>Department-2 activation latency (TTFT proxy)<\/td>\n<td>4,339 ms<\/td>\n<td>83 ms<\/td>\n<td><strong>\u221298.09 %<\/strong> (~<strong>52.3\u00d7<\/strong>)<\/td>\n<\/tr>\n<tr>\n<td>Baseline Agent-1 prefill<\/td>\n<td>4,346 ms<\/td>\n<td>\u2013<\/td>\n<td>\u2013<\/td>\n<\/tr>\n<tr>\n<td>Baseline Agent-2 prefill<\/td>\n<td>4,339 ms<\/td>\n<td>\u2013<\/td>\n<td>\u2013<\/td>\n<\/tr>\n<tr>\n<td>SwarmKV per-branch decode (avg)<\/td>\n<td>\u2013<\/td>\n<td>77 ms<\/td>\n<td>\u2013<\/td>\n<\/tr>\n<tr>\n<td>Redundant prefill eradicated<\/td>\n<td>\u2013<\/td>\n<td>\u2013<\/td>\n<td>8,685 ms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Translation: the baseline spent 4,339 ms of the second agent\u2019s perceived latency re-doing the dense consideration move it had simply completed 4 seconds earlier on the identical bytes. SwarmKV seems at that and says \u201cwhat if we didn\u2019t?\u201d and ships an 83-millisecond reply. The cleanest single-number measurement of \u201chow costly was that prefill?\u201d is simply the ratio of these two timings; every part else within the department is a rounding error.<\/p>\n<h3 class=\"wp-block-heading\">The place the per-branch ~83 ms goes<\/h3>\n<p class=\"wp-block-paragraph\">The thesis of this entire article rests on one inequality: <em>per-branch restore + decode is way, less expensive than a redundant doc prefill.<\/em> The harness measures this immediately on the combination degree \u2014 the per-branch wall clock (allocate + copy + restore + decode, finish to finish) is <strong>71\u201383 ms<\/strong> relying on which department we have a look at, in opposition to a redundant prefill value of <strong>~4,339 ms<\/strong>. A ~52\u00d7 ratio on the combination degree is what makes every part else on this article work.<\/p>\n<p class=\"wp-block-paragraph\">For extra outcomes and numbers, I\u2019ll suggest to take a look at immediately on the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/AnubhabBanerjee\/swarmkv\/blob\/main\/examples\/example-run-results\/final_result.docx\">instance run report<\/a>.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">6. \u201cOK, however how is that this completely different from vLLM \/ prefix caching \/ SGLang RadixAttention?\u201d<\/h2>\n<p class=\"wp-block-paragraph\">A really affordable query, and price answering immediately, as a result of the inference-infra world has a whole lot of overlapping primitives and an HPC reader will ask this within the first remark.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>vLLM \/ steady batching \/ paged consideration.<\/strong> Optimized for multi-tenant decode-time serving: many concurrent requests at completely different decode steps, scheduling the subsequent token throughout them below streaming load. Headline primitive: paged consideration. Unit of labor: a streaming firehose of impartial person prompts.<\/li>\n<li class=\"wp-block-list-item\"><strong>TGI \/ vLLM prefix caching.<\/strong> Wonderful in case your shared prefix is request-scoped or session-scoped. Not designed to reveal KV snapshots as first-class objects you&#8217;ll be able to hand to a distinct <code>llama_context<\/code> operating a distinct downstream activity in the identical course of.<\/li>\n<li class=\"wp-block-list-item\"><strong>SGLang RadixAttention.<\/strong> Tree-shaped prefix sharing inside a serving runtime \u2014 the closest cousin, however it&#8217;s a <em>server<\/em>, not a single-process orchestration primitive.<\/li>\n<li class=\"wp-block-list-item\"><strong>llama.cpp\u2019s personal state save\/restore.<\/strong> Exists, per-context. SwarmKV is the pipeline-level glue: a DAG, a host-buffer area sized by the engine itself, a memcpy fan-out, a <code>LlamaGuard<\/code> coverage, and a documented no-op patiently awaiting an upstream bind API.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">7. So\u2026 how do I really strive it?<\/h2>\n<p class=\"wp-block-paragraph\">Properly, I already posted the Github hyperlink initially of the article. If in case you have come up to now down, please work laborious yet another time and scroll again as much as the highest.<\/p>\n<p class=\"wp-block-paragraph\">Artifacts land below <code>examples\/example-run-results\/<\/code>: <code>best_run.json<\/code>, <code>all_trials.csv<\/code>, <code>plots\/*.png<\/code>, and a story <code>final_result.docx<\/code> that walks by way of methodology and limitations. <\/p>\n<p class=\"wp-block-paragraph\">Necessities: Linux, CUDA toolkit, an NVIDIA GPU (Pascal or newer; shopper or datacenter each work), a GGUF mannequin that matches in your VRAM, and the endurance to learn a CMake file as soon as.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">8. Plot twist \u2014 that is simply SIB broadcast in a transformer costume<\/h2>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/telecom_sib_analogy-1024x683.png\" alt=\"\" class=\"wp-image-665567\"\/><\/figure>\n<p class=\"wp-block-paragraph\">I ought to in all probability confess at this level: I&#8217;m not a \u201cGPU particular person\u201d by coaching. I got here up by way of telecom \u2014 5G NR with a foot creeping firmly into 6G analysis \u2014 and I began taking a look at LLM inference infrastructure as a result of each downside on this codebase felt unusually acquainted.<\/p>\n<p class=\"wp-block-paragraph\"><strong>One-sentence decoder ring for readers with out a 3GPP background:<\/strong> in a 5G community, the cell tower doesn&#8217;t unicast community configuration to each cellphone individually \u2014 it <em>broadcasts<\/em> a small set of System Info Blocks (SIB1, SIB2, \u2026) as soon as on a shared channel, each cellphone in vary reads the identical broadcast, and per-user information rides on high of that shared context on a devoted channel. The acronyms within the desk under \u2014 <strong>MIB<\/strong> (Grasp Info Block, the very very first thing each cellphone reads), <strong>PBCH<\/strong> and <strong>PDSCH<\/strong> (the shared broadcast and downlink information channels), <strong>HARQ<\/strong> (the receiver\u2019s \u201chold what we already decoded, solely re-send what was lacking\u201d retransmission protocol), and <strong>RNTI<\/strong> (the short-term ID that distinguishes one cellphone\u2019s visitors from one other\u2019s) \u2014 are simply names for the channels and identifiers that separate <em>shared, computed as soon as<\/em> from <em>distinctive per shopper<\/em>. That distinction is the entire analogy.<\/p>\n<p class=\"wp-block-paragraph\">Have a look at this side-by-side and inform me with a straight face these are completely different issues:<\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th>5G NR cell broadcast (on the gNB)<\/th>\n<th>SwarmKV (on the GPU)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>One MIB on PBCH per SS burst<\/td>\n<td>One shared doc tokenized as soon as<\/td>\n<\/tr>\n<tr>\n<td>Repeated SIBs (SIB1, SIB2, \u2026) on PDSCH<\/td>\n<td>Serialized KV snapshot in MemoryPool<\/td>\n<\/tr>\n<tr>\n<td>Each camped UE within the cell reads the identical SI<\/td>\n<td>Each analytical department reads the identical snapshot<\/td>\n<\/tr>\n<tr>\n<td>UE-specific devoted PDSCH for unicast person information<\/td>\n<td>Per-branch <code>llama_context<\/code> decoding the department immediate<\/td>\n<\/tr>\n<tr>\n<td>RNTI per UE distinguishes unicast streams<\/td>\n<td>Per-branch buffer + sequence id distinguishes department state<\/td>\n<\/tr>\n<tr>\n<td>HARQ soft-buffer retained throughout retransmissions<\/td>\n<td>KV snapshot retained throughout branches<\/td>\n<\/tr>\n<tr>\n<td>Skip broadcast \u2192 each UE forces unicast SI \u2192 air interface melts<\/td>\n<td>Skip snapshot \u2192 each department re-prefills the doc \u2192 GPU melts<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h3 class=\"wp-block-heading\">A fast apart to 2 very completely different audiences<\/h3>\n<p class=\"wp-block-paragraph\">To my HPC and CUDA-first buddies studying this: I do know. KV reuse isn&#8217;t a brand new thought. vLLM has prefix caching, SGLang has RadixAttention, llama.cpp itself exposes state save\/restore. SwarmKV\u2019s contribution isn&#8217;t the <em>primitive<\/em>; it&#8217;s the <em>single-process orchestration form<\/em> \u2014 a tiny C++ DAG runtime that exposes \u201cprefill as soon as, fan out N branches\u201d as a first-class operation, sized for one 8 GiB shopper GPU, with the protection rails (<code>LlamaGuard<\/code>, <code>swarmkv_validate_context_budget<\/code>, the documented bind no-op) {that a} researcher really must ship a demo on a Tuesday. Please put the pitchforks down.<\/p>\n<p class=\"wp-block-paragraph\">To my telecom buddies: if \u201cKV cache\u201d gave the impression of a international language till ten minutes in the past, you aren&#8217;t behind \u2014 you&#8217;re early. For twenty years our world was FPGAs, ASICs, and PRBs. We optimized spectrum, not silicon. Then AI-RAN, NWDAF, NVIDIA Aerial, the AI-RAN Alliance, and the 3GPP Rel-20 examine objects all occurred in roughly the identical eighteen months, and the subsequent decade of telecom careers now calls for being bilingual between spectrum-world and GPU-world. The instinct interprets cleanly. You&#8217;ve gotten been fanning out shared computation to many customers because the first CRS pilot. Similar animal, only a new zoo.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">9. Sincere caveats (as a result of the feedback are coming)<\/h2>\n<p class=\"wp-block-paragraph\">If you happen to got here right here to seek out what&#8217;s incorrect with the mission \u2014 congratulations, the mission discovered its first reader. From the constraints part of <code>final_result.docx<\/code> and the inline feedback within the supply:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>KV staging is host-side.<\/strong> <code>MemoryPool<\/code> allocates <code>ggml_backend_buffer_t<\/code> from the <strong>CPU machine<\/strong> (<code>ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU)<\/code>). Department decode nonetheless runs on the GPU; solely the snapshot transit is host-staged by way of <code>llama_state_get_data<\/code> \u2192 memcpy \u2192 <code>llama_state_seq_set_data<\/code>. A tool-aware materialize lives on the roadmap, blocked on the identical upstream KV bind API that <code>bind_contiguous_cache<\/code> is ready for.<\/li>\n<li class=\"wp-block-list-item\"><strong>Shared decode mutex (below the pinned upstream revision).<\/strong> <code>LlamaGuard<\/code> serializes each <code>llama_*<\/code> name from employee threads. Beneath the <code>llama.cpp<\/code> revision and GPU configuration used on this mission, concurrent decode from a number of threads on a single GPU was not reliably secure \u2014 the precise behaviour relies on backend, model, and graph scheduling, however in our setup the conservative selection was a world lock. The DAG-level concurrency is actual, however per-request GPU compute stays sequential. That is the only largest efficiency limitation in V1, and it&#8217;s precisely the place Half 2 of this collection picks up.<\/li>\n<li class=\"wp-block-list-item\"><strong><code>SwarmKV_Prefill_Ms<\/code> studies 0.<\/strong> Recognized instrumentation hole in how <code>OrchestratorContext::node_name<\/code> is consumed inside <code>PrefillNode<\/code>. The prefill <em>ran<\/em> (you see its value in <code>End_To_End_Ms<\/code> and the derived efficient shared-prefill value), it&#8217;s simply not being keyed accurately into <code>timings_ms<\/code>. The efficient shared prefill is calculated as <code>SwarmKV_End_To_End_Ms \u2212 max(SwarmKV_AgentA_Ms, SwarmKV_AgentB_Ms)<\/code> \u2248 5,189 ms. Reporting bug, not correctness bug. Logged.<\/li>\n<li class=\"wp-block-list-item\"><strong>Artificial doc.<\/strong> The benchmark builds a deterministic 3,501-token doc by repeating \u201cThe fast brown fox jumps over the lazy canine. \u201d till the token goal is hit. This isolates the efficiency sign from content material results and retains trials reproducible bit-for-bit. Actual paperwork will produce noisier per-trial absolute timings; the <em>structural ratios<\/em> won&#8217;t transfer.<\/li>\n<li class=\"wp-block-list-item\"><strong>Single GPU class.<\/strong> All numbers within the report come from one Pascal-class GTX 1080. Newer GPUs (Ada, Hopper) prefill a lot quicker \u2014 absolutely the ms numbers will shrink, however the structural ratio between full-prefill value and short-decode value (which is what SwarmKV exploits) doesn&#8217;t.<\/li>\n<li class=\"wp-block-list-item\"><strong><code>bind_contiguous_cache<\/code> is a documented no-op.<\/strong> Sure, nonetheless. Till upstream lands a steady external-KV attachment API, the perform validates its arguments, casts them to <code>void<\/code>, and goes dwelling.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">However don\u2019t fear, every part on this checklist is on the roadmap. None of it adjustments the headline consequence although. The purpose of placing it in writing is that <em>you shouldn&#8217;t must dig for it<\/em> \u2014 and the second a benchmark weblog put up hides its caveats is the second its numbers cease being reliable.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\">10. The V1 ceiling (and the setup for Half 2)<\/h2>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/swarmkv_timeslice_agents-1024x683.png\" alt=\"\" class=\"wp-image-665568\"\/><\/figure>\n<p class=\"wp-block-paragraph\">SwarmKV proves you could cease re-prefilling. However for those who reread caveat #2, you&#8217;ve gotten already noticed the subsequent ceiling: <strong>the GPU compute itself remains to be serialized.<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Here&#8217;s what really occurs on the wall clock. The DAG-level concurrency is real \u2014 branches are actual <code>std::async<\/code> staff with actual dependency gating. However each department\u2019s <code>llama_decode<\/code> runs inside <code>LlamaGuard<\/code>, a single world mutex. So whereas the <em>orchestration<\/em> followers out, the <em>GPU work<\/em> traces up single file. Two branches take turns. Fifty branches take fifty turns. The GPU is rarely really shared; it&#8217;s time-multiplexed by hand, one lock at a time, with no equity assure and no strategy to measure who&#8217;s ravenous whom.<\/p>\n<p class=\"wp-block-paragraph\">That&#8217;s high-quality for a two-agent demo. It falls aside the second you run the workload SwarmKV is definitely constructed for: <strong>50 specialised micro-agents competing for one GPU.<\/strong> At that scale you cease caring about \u201cdid we keep away from re-prefill\u201d and begin caring about questions a hand-rolled mutex can&#8217;t reply:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">When 50 brokers need the GPU directly, who goes first, and the way can we make it truthful?<\/li>\n<li class=\"wp-block-list-item\">What&#8217;s the p50, p95, and p99 latency every agent sees whereas sharing one card?<\/li>\n<li class=\"wp-block-list-item\">How a lot jitter does competition add, and the place does throughput collapse?<\/li>\n<li class=\"wp-block-list-item\">How can we slice GPU compute cycles on function as an alternative of accidentally?<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">That&#8217;s <strong>Half 2 of this collection: Time-Slicing the GPU for Concurrent Agent Swarms.<\/strong> Toy brokers run sequentially in Python. Manufacturing brokers run concurrently on naked metallic, and managing VRAM and compute when many micro-agents share one NVIDIA GPU is its personal self-discipline. Half 2 builds a Kubernetes-level time-slice profiler that dynamically allocates compute cycles and measures p50\/p95\/p99 latency, jitter, and throughput proxies when agentic inference workloads share a GPU by way of the Kubernetes Machine Plugin with CUDA time-slicing. The worldwide mutex in SwarmKV is strictly the factor it replaces with one thing measurable.<\/p>\n<p class=\"wp-block-paragraph\"><em>(For the curious: there&#8217;s a separate, orthogonal V1 limitation value a future SwarmKV V2 put up \u2014 the pipeline at the moment waits for the <strong>total<\/strong> prefill to complete earlier than any department begins, even when a department solely wants the primary 500 tokens of context. Letting branches begin the moment their required prefix slice is materialized is an actual win, however it&#8217;s its personal story and its personal benchmark. It isn&#8217;t Half 2. Half 2 is about sharing the GPU throughout many brokers; that prefill-streaming thought is a follow-up.)<\/em><\/p>\n<p class=\"wp-block-paragraph\">See you in Half 2.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<p class=\"has-caption-1-font-size wp-block-paragraph\"><em>Disclaimer: The illustrations on this article (the hero banner, the structure diagram, the telecom-vs-SwarmKV cut up panel, and the GPU time-slicing picture) have been generated utilizing AI (Claude Opus 4.8). They&#8217;re illustrative, not photographic, and any labels seen inside the photographs are stylized moderately than authoritative \u2014 discuss with the article physique and the code itself for exact perform names, metric values, and structure particulars. <\/em><\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>A humorous-but-real tour of SwarmKV \u2014 KV-snapshot fan-out, copy-on-fork host buffers, and tips on how to make a two-agent analytical pipeline ~1.95\u00d7 quicker (and the second department\u2019s activation latency 52\u00d7 quicker) by being mildly imply to llama.cpp. of the \u201cManufacturing-Grade Agentic Inference\u201d collection. Every half removes one form of redundant work from an agentic LLM [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":15570,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[1266,74,1287,477,9367,2616,3221],"class_list":["post-15568","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-fan","tag-llm","tag-multiagent","tag-pipelines","tag-prefill","tag-sharing","tag-snapshot"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15568"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15568\/revisions"}],"predecessor-version":[{"id":15569,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15568\/revisions\/15569"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15570"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-06-09 18:29:30 UTC -->