A humorous-but-real tour of SwarmKV \u2014 KV-snapshot fan-out, copy-on-fork host buffers, and tips on how to make a two-agent analytical pipeline ~1.95\u00d7 quicker (and the second department\u2019s activation latency 52\u00d7 quicker) by being mildly imply to llama.cpp.<\/em><\/p>\n

\n
of the \u201cManufacturing-Grade Agentic Inference\u201d collection<\/strong>. Every half removes one form of redundant work from an agentic LLM pipeline. Half 1 (this put up) kills redundant prefill. Half 2 tackles redundant ready \u2014 how 50 micro-agents share one GPU by way of time-slicing. Half 3 retains RAG retrieval on the GPU with a customized CUDA Prime-Ok kernel. Half 4 persists agent state throughout hand-offs so the subsequent agent by no means has the cold-start downside.<\/p>\n<\/blockquote>\n
Key takeaways<\/h2>\n
\n
The issue:<\/strong> when a number of brokers learn the identical doc, a default serving stack makes each rerun the very same<\/em> prefill. That redundant dense consideration move is pure waste. <\/p>\n
The repair:<\/strong> run prefill as soon as, serialize the KV cache to a number buffer, memcpy<\/code> it per department, and restore it earlier than decoding. \u201cCompute as soon as, fan out.\u201d <\/p>\n
The receipts:<\/strong> on a seven-year-old GTX 1080, a two-agent pipeline obtained 48.69% quicker finish to finish (~1.95\u00d7)<\/em> and the second agent\u2019s activation latency dropped 98.09% (~52\u00d7)<\/em>, eliminating 8,685 ms<\/em> of redundant compute. <\/p>\n
The kicker:<\/strong> this isn’t a brand new algorithm. It’s methods engineering \u2014 and it’s the identical \u201cbroadcast shared state as soon as\u201d choice a 5G cell tower has made each 80 ms since LTE.<\/p>\n
TL;DR<\/strong>: Normal LLM serving makes each analytical agent re-prefill the identical shared doc. Your GPU dutifully re-executes billions of redundant prefix-prefill multiplications. The identical bytes. The identical weights. The identical quantization. All to recalculate a state it already completed calculating 4 seconds in the past. SwarmKV runs the prefill as soon as<\/em>, serializes the ensuing KV state to a number buffer by way of llama_state_get_data<\/code>, memcpy<\/code>s that buffer right into a per-branch allocation, and lets every department restore the snapshot with llama_state_seq_set_data<\/code> earlier than decoding from the place the doc left off. Sure, it’s a actual round-trip\u2014serialize, copy, restore, however as a result of redundant prefill compute scales quadratically whereas the KV state switch scales linearly, transferring the info throughout a constrained Pascal reminiscence bus remains to be vastly cheaper than recalculating the eye matrices from scratch. That’s mirrored on the consequence on a seven-year-old GTX 1080: 48.69% end-to-end speedup on a two-agent pipeline, 98.09% discount in branch-2 activation latency (~52\u00d7)<\/em>, 8,685 ms of redundant dense compute eradicated, zero new transformer tips. Simply the systems-engineering place that \u201ccompute as soon as, fan out\u201d beats \u201ccompute N occasions, hope no one notices.\u201d<\/p>\n
\n\nGithub Repo<\/strong>: https:\/\/github.com\/AnubhabBanerjee\/swarmkv<\/a><\/p>\n<\/blockquote>\n<\/figure>\n (Fast confession earlier than we begin: I got here at this from a 5G\/6G RAN engineering background. Because it seems, fanning out a shared computation to many downstream customers is shockingly near what a cell tower has been doing each 80 ms since LTE when it broadcasts SIB1. There\u2019s a complete part on that under \u2014 part 8 \u2014 nevertheless it\u2019s additionally why I\u2019m penning this within the first place.)<\/em><\/p>\n \nStructure psychological mannequin<\/strong> \u2014 hold this open when you learn.<\/p>\n Doc \u2192 PrefillNode \u2192 llama_state_get_data \u2192 host KV buffer \u2192 memcpy per department \u2192 llama_state_seq_set_data \u2192 AnalyticalNode decode (RoPE continues at prefix_seq_len)<\/code><\/p>\n Every thing under is simply commentary on one a part of that line.<\/p>\n<\/blockquote>\nSwarmKV architectural overview<\/figcaption><\/figure>\n \n1. A confession: most of your second agent\u2019s \u201cwork\u201d is a rerun<\/h2>\nIf in case you have ever pointed two analytical brokers on the identical doc by way of vanilla llama.cpp, here’s what actually occurs (with a bit of little bit of intentional dramatization):<\/p>\n\nYou: \u201cPlease present an outline of this 3,500-token spec, and individually, checklist its license obligations.\u201d<\/em><\/p>\n llama.cpp (Agent 1): \u201cCertain. Loading mannequin. Prefilling doc. Decoding reply.\u201d<\/em><\/p>\n GPU spends 4,346 ms on dense consideration<\/p>\n llama.cpp (Agent 1): \u201cExecuted. Right here\u2019s a 6-token reply.\u201d<\/em><\/p>\n You: \u201cNice. Now Agent 2.\u201d<\/em><\/p>\n llama.cpp (Agent 2): \u201cCertain. Loading mannequin. Prefilling doc \u2014 \u201c<\/em><\/p>\n You: \u201cWait, you actually simply did that.\u201d<\/em><\/p>\n llama.cpp (Agent 2): \u201cI\u2019m an impartial llama_context<\/code>. I’ve no reminiscence of Agent 1. I’ve no reminiscence of something. I’m a lovely, stateless new child.\u201d \ud83e\udee1<\/em><\/p>\n<\/blockquote>\n GPU spends one other 4,339 ms on bit-for-bit an identical consideration math.<\/p>\n Your GPU thermal sensor: will get a great exercise; Your AWS invoice: develops a humorousness<\/em>; Your second agent\u2019s TTFT: 4.3 seconds earlier than it may well reply a 4-token query.<\/p>\n That’s the joke. That’s the soiled secret of each \u201cagentic\u201d pipeline that spreads out from a shared doc. Every department begins from a clean slate and rebuilds the identical KV cache the earlier department simply completed constructing. The deeper the doc, the more severe the tax. At 3,500 tokens on a Pascal GPU, the huge<\/em> majority of the second agent\u2019s perceived latency isn’t the reply \u2014 it’s studying the doc once more.<\/p>\n SwarmKV is what occurs if you resolve the second studying is optionally available and you’d moderately write 1,500 traces of C++ than let every agent construct the identical KV cache time and again.<\/p>\nNow think about, the toy demo on this repo is about two brokers over a abstract\/license-check. The actual form of the workload it’s constructed for is N specialised evaluators over one dense technical doc<\/em>. Image an AI patent and prior-art pipeline: one 50,000-token technical specification on the root, and fifty concurrent branches evaluating novelty, mapping claims, retrieving prior artwork, checking freedom-to-operate, assessing moral compliance, and translating into jurisdiction-specific language. The baseline value of that pipeline on a default serving stack is fifty full prefills<\/em> of the identical spec. The SwarmKV value is one prefill plus fifty memcpy<\/code>s. That asymmetry is deliberately designed, and your complete purpose the repo exists. I’ve written individually about detecting AI in invention reporting<\/a> \u2014 that is the infrastructure half of that half. Inventor\u2019s-notebook issues are precisely the rationale SwarmKV is constructed round.<\/p>\n \n2. Why does prefill exist in any respect? (a one-minute crash course)<\/h2>\nSkip this for those who already know. For everybody else, right here is the quick model.<\/p>\n An autoregressive LLM serves a request in two phases. Prefill<\/em> is the dense move that pushes each immediate token by way of each transformer layer as soon as and populates the per-layer key\/worth (KV) cache. Decode<\/em> then runs token by token, attending to the prefilled KV cache and rising it incrementally.<\/p>\n Prefill value grows roughly linearly with immediate size. Decode, by comparability, is reasonable per token. On a Pascal-class GTX 1080 operating Qwen2.5-7B Q4_K_M, prefilling a ~3,500-token doc takes about 4.3 seconds<\/em>; decoding a brief department immediate takes a whole lot of milliseconds, since it’s dominated by setup, not arithmetic. That point distinction between the prefill and decode is strictly the leverage SwarmKV makes use of.<\/p>\n Mainstream serving stacks (vLLM, TGI, SGLang, llama.cpp\u2019s personal server) deal with each request as an impartial context. A few of them have prefix caching<\/em>, however it’s normally request-scoped or session-scoped \u2014 not graph-scoped. They’re constructed to maximise throughput throughout many impartial person prompts, to not share state inside one analytical pipeline that followers out from a single shared doc. For that DAG-shaped workload \u2014 one root, many leaves, identical information<\/em> \u2014 each public stack I attempted made me pay for the foundation as soon as per leaf.<\/p>\n SwarmKV serves as the express orchestration layer, leveraged in C++ to bypass runtime abstractions, assure deterministic pointer life-cycles, and drive hardware-level memcpy<\/code> effectivity.<\/p>\n \n3. The \u201csimply snapshot the KV\u201d lightbulb (and why it\u2019s more durable than it sounds)<\/h2>\nThe pitch is easy:<\/p>\n\nRun prefill as soon as on the shared doc below sequence id kSwarmkvPrefixSeqId<\/code>.<\/li>\n Serialize the ensuing KV state into a number buffer by way of llama_state_get_data<\/code>.<\/li>\n For every downstream department, memcpy<\/code> that buffer right into a per-branch allocation.<\/li>\nSpin up a recent llama_context<\/code>, name llama_state_seq_set_data<\/code> to put in the snapshot, then decode the department immediate with RoPE positions persevering with from prefix_seq_len<\/code>.<\/li>\n<\/ol>\nThat is the \u2018compute as soon as, fan out\u2019 paradigm. The one purpose it takes greater than a 30-line llama.cpp<\/code> patch to realize is that three tedious edge instances instantly break the naive method. The idea is superbly easy and needs to be a simple weekend mission, however low-level {hardware} and methods realities make it a large engineering problem to really implement.<\/p>\nDrawback A: How large is the KV?<\/h3>\nA simple reply: n_layers \u00d7 n_head_kv \u00d7 n_ctx \u00d7 head_dim \u00d7 dtype \u00d7 2<\/code>. Properly, that hand-derived quantity drifts each time the quantization format adjustments, each time the GQA ratio adjustments, or each time the engine provides a brand new state subject. The one sincere quantity is the one the engine tells you below the present<\/em> construct.<\/p>\n So MemoryPool<\/code> spins up a disposable llama_context<\/code> solely to ask:<\/p>\nsize_t MemoryPool::get_required_kv_size(uint32_t n_ctx) {\n \/\/ Begin from library defaults so fields we don't care about stay sane.\n llama_context_params params = llama_context_default_params();\n params.n_ctx = n_ctx;\n\n \/\/ Assemble a disposable context solely to question serialized state footprint.\n llama_context * ctx = llama_init_from_model(model_ref, params);\n if (!ctx) {\n throw std::runtime_error(\"MemoryPool::get_required_kv_size: llama_init_from_model failed.\");\n }\n\n \/\/ Ask llama.cpp what number of bytes a full state blob would occupy for this ctx.\n const size_t sz = llama_state_get_size(ctx);\n llama_free(ctx);\n\n \/\/ If the engine studies zero, fall again to a small non-zero allocation so checks\n \/\/ nonetheless train the registry with out pretending we all know actual tensor layouts.\n if (sz == 0) {\n return size_t{1} << 20;\n }\n return sz;\n}<\/code><\/pre>\nThe logic right here is easy: ask the engine politely as an alternative of trusting a pdf or a mathematical system. Simply spin up a context, ask the dimensions and allocate precisely that a lot \u2013 a easy but very profitable recipe.<\/p>\nDrawback B: llama.cpp is a choosy eater about concurrent decode<\/h3>\nBeneath the pinned upstream llama.cpp<\/code> revision and GPU configuration used on this mission, concurrent decode from a number of threads on a single GPU was not reliably secure. The precise behaviour relies on the backend, the model, and the graph scheduler \u2014 in newer revisions or with remoted streams it could behave higher \u2014 however in our setup the failure modes have been one among: (a) crash, (b) corrupted KV, or (c) a ten-minute hold when you Google whether or not ggml has a thread-local area but. Spoiler: within the pinned upstream, not likely.<\/p>\n The sturdy reply is to serialize the llama API floor on the boundary:<\/p>\nnamespace swarmkv {\n\n\/\/ llama.cpp CUDA paths usually are not secure for concurrent decode from a number of threads\n\/\/ on one GPU with out exterior serialization. All node execute() our bodies should\n\/\/ maintain this mutex round llama_init \/ llama_decode \/ llama_free \/ state I\/O.\ninline std::mutex & llama_api_mutex() {\n static std::mutex m;\n return m;\n}\n\n} \/\/ namespace swarmkv\n\nstruct LlamaGuard {\n std::lock_guard<:mutex> lock;\n LlamaGuard() : lock(swarmkv::llama_api_mutex()) {}\n};<\/:mutex><\/code><\/pre>\nA easy 20-line header defines your complete concurrency coverage. Each node\u2019s execute()<\/code> physique holds this round llama_init_from_model<\/code> \/ llama_decode<\/code> \/ llama_state_seq_set_data<\/code> \/ llama_free<\/code>. The DAG-level concurrency is actual (futures, dependencies, fanout); the GPU compute<\/em> interleaves below a world lock. Pedants will accurately word this leaves perf on the ground in comparison with a hypothetical concurrent-decode upstream. Maintain that thought \u2014 it’s the actual bottleneck Half 2 of this collection goes after.<\/p>\n Drawback C: There is no such thing as a steady exterior KV bind API<\/h3>\nThe aesthetically excellent implementation could be to allocate one contiguous KV buffer, connect it to the brand new context immediately after which skip the memcpy<\/code> solely. Upstream llama.cpp<\/code> exposes llama_memory_t<\/code> and graph decode paths, however the public header pinned on this repo doesn’t ship a steady, exported llama_kv_cache_bind<\/code>-style image.<\/p>\n So SwarmKV does the next-best factor: it retains the decision web site, names it actually, and writes this path on high of llama_state_set_data<\/code> as an alternative.<\/p>\n void KVHandoff::bind_contiguous_cache(llama_context * ctx, ggml_backend_buffer_t cache) {\n \/\/ Validate arguments so misuse fails quick throughout bring-up and CI smoke runs.\n \/\/ A null context can't decode; a null cache deal with is a configuration bug.\n if (!ctx || !cache) {\n throw std::invalid_argument(\"KVHandoff::bind_contiguous_cache: null context or buffer.\");\n }\n\n \/\/ Explicitly mark each parameters as deliberately unused on this revision.\n \/\/ This prevents -Wunused-parameter warnings below strict warning flags.\n (void) ctx;\n (void) cache;\n\n \/\/ No steady bind name is issued right here; see file-level remark above.\n \/\/ When upstream provides a supported attachment API, implement it solely on this perform.\n}<\/code><\/pre>\nI do know, I do know. It’s a perform that does nothing. It has full argument validation, a docstring twice the size of the physique, and a steady place within the name graph. It’s patiently ready for the day upstream lets it really do its job. I’ve written extra sincere code in my life, I simply can’t bear in mind when!<\/p>\n That is additionally the half the place cautious readers go \u201cwait, if bind_contiguous_cache<\/code> is a no-op, what’s the MemoryPool<\/code> buffer even for?\u201d<\/em> Wonderful query. It’s the staging space<\/em> \u2014 the canonical buffer the place PrefillNode writes its llama_state_get_data<\/code> blob, and the supply that every department memcpy<\/code>s from. Decode itself makes use of the context\u2019s internally-managed KV. Pool buffer = host-side fan-out scratch; context KV = the engine\u2019s personal factor. Two reminiscence areas, one snapshot, zero magic.<\/p>\n \n4. The five-step pipeline (the actually-cool half)<\/h2>\nStep 0: Validate doc + max_branch + 128 \u2264 n_ctx (context_budget.h, fail-fast)\nStep 1: Construct the DAG; DFS-check for cycles (Orchestrator)\nStep 2: Spawn std::async staff; gate on futures (Orchestrator)\nStep 3: Prefill as soon as, serialize KV to host buffer (PrefillNode + MemoryPool)\nStep 4: memcpy snapshot \u2192 department buffer \u2192 decode (AnalyticalNode + KVHandoff)<\/code><\/pre>\nLet\u2019s stroll by way of each with the actual code. The snippets have been saved quick intentionally, whereas the complete information are tiny and price studying.<\/p>\n Step 0 \u2014 Fail-fast context price range<\/h3>\nThree traces that prevent from a 3 AM Slack message out of your previous self:<\/p>\n const int32_t required = prefix_tokens + max_branch + generation_headroom;\n if (required > restrict) {\n throw std::runtime_error(\n \"Context price range exceeded: prefix_tokens=\" + std::to_string(prefix_tokens) +\n \" max_branch_tokens=\" + std::to_string(max_branch) +\n \" headroom=\" + std::to_string(generation_headroom) +\n \" required=\" + std::to_string(required) + \" n_ctx=\" + std::to_string(restrict));\n }<\/code><\/pre>\nThis runs earlier than any context is constructed, any pool buffer is allotted or any GPU reminiscence is touched. If you happen to ask SwarmKV<\/em> to prefill 4,000 tokens into an n_ctx=4096<\/code> context with two branches and 128 tokens of decode headroom, it tells you the maths doesn’t work and goes to sleep. The kindest factor you are able to do in your future self is to reject inconceivable configurations earlier than even allocating the primary byte.<\/p>\n Step 1 \u2014 DAG cycle detection<\/h3>\nThe orchestrator does a normal 3-color DFS on the dependency adjacency checklist:<\/p>\n \/\/ dfs lambda walks adjacency lists and throws when a back-edge signifies a cycle.\n auto dfs = [&](auto self, const std::string & u) -> void {\n \/\/ Mark node u as at the moment on the recursion stack (visiting).\n state[u] = 1;\n \/\/ Discover all outgoing dependency edges from u to downstream nodes v.\n for (const auto & v : adj[u]) {\n \/\/ If v is visiting, we discovered a cycle u -> v and should abort pipeline setup.\n if (state[v] == 1) {\n \/\/ Throw with edge names so graph misconfiguration is straightforward to diagnose.\n throw std::runtime_error(\"Dependency cycle detected: \" + u + \" -> \" + v);\n }\n \/\/ Recurse solely when v has not been totally processed but.\n if (state[v] == 0) {\n \/\/ Proceed DFS from baby node v.\n self(self, v);\n }\n }<\/code><\/pre>\nI do know it\u2019s boring, however belief me, it\u2019s needed. It’s the algorithmic equal of checking your shoelaces earlier than operating. Skip it as soon as and your pipeline will spend the remainder of its quick life ready on itself. The error message consists of the offending edge, so you’ll find the typo with out grepping.<\/p>\n Step 2 \u2014 One std::async<\/code> per node, gated on shared futures<\/h3>\nworker_tasks.push_back(std::async(\n std::launch::async,\n [this, name, state, dependencies, &completion_promises, &completion_futures]() {\n \/\/ Learn this node's watermark requirement as soon as for dependency gating selections.\n const int32_t req = nodes.at(identify)->required_prefix_tokens();\n \/\/ Look ahead to every upstream dependency based on V2 watermark guidelines.\n for (const auto & dep_name : dependencies) {\n \/\/ Resolve upstream node pointer for prefill supplier detection.\n ExecutionNode * dep = nodes.at(dep_name).get();\n \/\/ If upstream is prefiller and this department makes use of watermark gating, wait on watermark.\n if (dep->is_prefill_provider() && req >= 0) {\n \/\/ Block till PipelineState watermark >= required_prefix_tokens (speculative begin).\n state->wait_for_watermark(req);\n } else {\n \/\/ In any other case protect V1 conduct: wait till upstream node thread completes.\n completion_futures.at(dep_name).wait();\n }\n }\n \/\/ Construct llama_context_params with orchestrator default n_ctx price range.\n llama_context_params params = llama_context_default_params();\n \/\/ Carry n_ctx to SwarmKV default pipeline context for multi-k token paperwork.\n params.n_ctx = kSwarmkvDefaultPipelineCtx;\n \/\/ Bundle mannequin\/pool\/identify into OrchestratorContext for node execute().\n OrchestratorContext ctx = {\n this->memory_pool->get_model(),\n params,\n this->memory_pool,\n identify.c_str(),\n };\n \/\/ Run node logic and fulfill promise so dependents can proceed.\n strive {\n \/\/ Dispatch to PrefillNode or AnalyticalNode implementation.\n nodes.at(identify)->execute(state, &ctx);\n \/\/ Sign profitable completion to shared_future waiters.\n completion_promises.at(identify).set_value();\n } catch (...) {\n if (req > 0) {\n state->signal_milestone_consumed(req);\n }\n strive {\n completion_promises.at(identify).set_exception(std::current_exception());\n } catch (...) {\n }\n throw;\n }\n }));<\/code><\/pre>\nOne std::promise<\/code> per node, with a std::shared_future<\/code> so a number of downstream branches can wait on the identical upstream completion with out enjoying pass-the-future. The failure path at all times units the exception, so dependents don’t wait endlessly. Now we have all debugged the choice, and oh boy did we not get pleasure from it!<\/p>\n Discover what’s not<\/em> on this loop: any logic about prefill, KV, or branches. The orchestrator doesn’t know what a PrefillNode<\/code> is. It is aware of about names<\/em>, edges<\/em>, and guarantees<\/em>. The node-specific work lives in execute()<\/code> and is totally polymorphic behind the ExecutionNode<\/code> digital interface. Just one duty for a kid, not overwhelming in any respect!<\/p>\n Step 3 \u2014 Prefill as soon as, export KV<\/h3>\nPrefillNode does 4 issues within the following sequence:<\/p>\n \nLearn the doc textual content from examples\/base_doc.txt<\/code>.<\/li>\n Tokenize it (with the resize-on-negative-return llama idiom).<\/li>\n Decode the tokens in chunks bounded by llama_n_batch(lctx)<\/code>, on sequence lane kSwarmkvPrefixSeqId<\/code>, with absolute RoPE positions matching absolutely the token index:<\/li>\n<\/ol>\n\/\/ Absolute RoPE place equals index within the full doc token stream.\nbatch.pos[i] = cur + i;\n\/\/ Every token belongs to precisely one sequence id checklist.\nbatch.n_seq_id[i] = 1;\n\/\/ Bind all doc tokens to the shared prefix sequence lane fixed.\nbatch.seq_id[i][0] = kSwarmkvPrefixSeqId;\n\/\/ Disable logits throughout prefill besides we hold zeros for all tokens right here.\nbatch.logits[i] = 0;<\/code><\/pre>\n4. Export the prefix-sequence KV into the canonical host buffer and stamp the watermark for the branches. <\/p>\n KVHandoff::bind_contiguous_cache(lctx, state->materialized_branch_buffer);\n\/\/ Mark prefill_complete so branches utilizing kSwarmkvWaitForPrefillComplete can proceed.\nstate->mark_prefill_complete();<\/code><\/pre>\nThat’s the total level of the article in two traces. Every thing else on this repo \u2014 the orchestrator, the LlamaGuard, the price range examine, the documented no-op \u2014 exists to feed these two traces and to ship their output to the branches in a single memcpy<\/code> with no additional spherical journeys.<\/p>\n Step 4 \u2014 Department decode below LlamaGuard<\/h3>\n1. Allocate a per-branch buffer sized for a similar n_ctx because the prefix:<\/p>\n \/\/ Allocate a department buffer sized for n_ctx so later decode has headroom in the identical blob coverage.\nbranch_buf = ctx->memory_pool->allocate_branch_cache(static_cast(ctx->ctx_params.n_ctx));<\/uint32_t><\/code><\/pre>\n2. Copy the canonical snapshot into the department buffer, a.okay.a., the well-known \u201ccopy-on-fork\u201d:<\/p>\n \/\/ Full-prefill path copies from canonical staging into the department allocation.\nKVHandoff::materialize_branch_cache(\n state->materialized_branch_buffer,\n branch_buf,\n fork_kv_bytes);<\/code><\/pre>\n\u2026which, if you click on by way of, turns into<\/p>\n std::memcpy(dst_ptr, src_ptr, ncopy);<\/code><\/pre>\nThat’s it. That’s \u201ccopy-on-fork on the storage layer\u201d, which accurately is memcpy<\/code>. It’s the primitive. Every thing fancy you’ve gotten examine prefix sharing \u2014 RadixAttention\u2019s reference counting, paged consideration\u2019s block desk indirection \u2014 is sitting on high of the identical thought: don\u2019t recompute, copy the bytes<\/em>.<\/p>\n 3. Spin up a recent llama_context<\/code> and restore the snapshot:<\/p>\n \/\/ Restore solely the prefix sequence lane so department decode stays remoted on seq 0.\nconst size_t n = llama_state_seq_set_data(\n lctx,\n static_cast(base),\n fork_kv_bytes,\n kSwarmkvPrefixSeqId);\n \/\/ Confirm llama consumed precisely the variety of bytes we copied into the department buffer.\nif (n != fork_kv_bytes) {\n \/\/ Free the context earlier than throwing to keep away from leaking VRAM on failure paths.\n llama_free(lctx);\n \/\/ Throw with a transparent message so operators can debug dimension mismatches rapidly.\n throw std::runtime_error(\"AnalyticalNode: llama_state_seq_set_data dimension mismatch.\");\n}<\/const><\/code><\/pre>\n4. Construct a single llama_batch<\/code> for the quick department immediate with RoPE positions persevering with<\/em> from the place the prefix ended:<\/p>\n for (int i = 0; i < batch.n_tokens; ++i) {\n \/\/ Copy the i-th department token id into the batch slot.\n batch.token[i] = tokens[static_cast(i)];\n \/\/ Place department tokens instantly after the forked prefix positions for proper RoPE.\n batch.pos[i] = static_cast(fork_prefix_len) + static_cast(i);\n \/\/ Every token participates in precisely one sequence id checklist entry.\n batch.n_seq_id[i] = 1;\n \/\/ Bind all department tokens to the shared prefix sequence lane fixed.\n batch.seq_id[i][0] = kSwarmkvPrefixSeqId;\n \/\/ Disable logits for all tokens besides the final one on this department step.\n batch.logits[i] = 0;\n}<\/llama_pos><\/llama_pos><\/size_t><\/code><\/pre>\nThat is the bit everybody will get incorrect the primary time. If you happen to neglect the offset and begin department positions at zero, rotary embeddings silently go sideways and the mannequin decodes from a place the prefix was by no means skilled for. The symptom is confidently coherent nonsense. Welcome to the worst form of debugging hell; please go away a tip in your means out.<\/p>\n 5. Lastly, one llama_decode<\/code>. Simply write a diagnostic string into PipelineState::node_outputs<\/code>, report timings_ms[name]<\/code>, free the batch and the context.<\/p>\n Three traces of enterprise logic per department. Two contexts in flight at decode time. One memcpy every. One world lock. Every thing else is plumbing.<\/p>\n \n5. The receipts (i.e., the numbers)<\/h2>\nNow could be the time to guage it in opposition to the baseline, and see if it was value doing all these hassles. All numbers come from examples\/example-run-results\/<\/code>.<\/p>\n Fast word on methodology earlier than anybody reaches for the rocks: each comparability under runs the identical<\/strong> mannequin (Qwen2.5-7B-Instruct-Q4_K_M.gguf<\/code>), the identical<\/strong> doc (a deterministic 3,501-token artificial doc generated by repeating \u201cThe fast brown fox jumps over the lazy canine. \u201c<\/em> till the token goal is hit \u2014 examples\/base_doc.txt<\/code>), the identical<\/strong> GPU (GTX 1080, 8 GiB, Pascal sm_61), the identical<\/strong> n_ctx=4096<\/code>, and the identical<\/strong> dtype. Baseline = two sequential llama_context<\/code> situations, every prefilling the complete doc then decoding its department immediate. SwarmKV = PrefillNode<\/code> as soon as + two AnalyticalNode<\/code> branches over the snapshot. Workload kind: prefill-dominated doc evaluation (RAG-style), not autoregressive chat. Three trials run back-to-back with a GPU-idle wait between them; the most effective is chosen by 2\u00b7TTFT_pct + E2E_pct<\/code>. <\/p>\n One metric definition earlier than the desk, as a result of it issues: we use \u201cDepartment-2 activation latency (TTFT proxy)\u201d<\/strong> \u2014 not the textbook serving-literature \u201crequest-arrival \u2192 first-output-token\u201d TTFT. We imply the time the second department spends in branch-specific work<\/em>: its activation latency after<\/em> the shared prefill is amortized throughout all branches. In a fan-out pipeline the fee the downstream shopper perceives is strictly this quantity, as a result of the upstream prefill is paid as soon as for the entire pipeline by design. The baseline worth for this metric is the redundant doc prefill that the second llama_context<\/code> is compelled to redo earlier than it may well reply; the SwarmKV worth is the fork + restore + short-prompt decode.<\/p>\n Headline: GTX 1080, Qwen2.5-7B Q4_K_M, 3,501-token doc, two branches<\/strong><\/p>\n \n\n\n\n\n\n\n\n\n\n\nMetric<\/th>\n Baseline (HF-style)<\/th>\n SwarmKV<\/th>\n Delta<\/th>\n<\/tr>\n<\/thead>\n Finish-to-end wall clock<\/td>\n 10,275 ms<\/td>\n 5,272 ms<\/td>\n \u221248.69 %<\/strong> (~1.95\u00d7<\/strong>)<\/td>\n<\/tr>\n Department-2 activation latency (TTFT proxy)<\/td>\n 4,339 ms<\/td>\n 83 ms<\/td>\n \u221298.09 %<\/strong> (~52.3\u00d7<\/strong>)<\/td>\n<\/tr>\n Baseline Agent-1 prefill<\/td>\n 4,346 ms<\/td>\n \u2013<\/td>\n \u2013<\/td>\n<\/tr>\n Baseline Agent-2 prefill<\/td>\n 4,339 ms<\/td>\n \u2013<\/td>\n \u2013<\/td>\n<\/tr>\n SwarmKV per-branch decode (avg)<\/td>\n \u2013<\/td>\n 77 ms<\/td>\n \u2013<\/td>\n<\/tr>\n Redundant prefill eradicated<\/td>\n \u2013<\/td>\n \u2013<\/td>\n 8,685 ms<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\nTranslation: the baseline spent 4,339 ms of the second agent\u2019s perceived latency re-doing the dense consideration move it had simply completed 4 seconds earlier on the identical bytes. SwarmKV seems at that and says \u201cwhat if we didn\u2019t?\u201d and ships an 83-millisecond reply. The cleanest single-number measurement of \u201chow costly was that prefill?\u201d is simply the ratio of these two timings; every part else within the department is a rounding error.<\/p>\n The place the per-branch ~83 ms goes<\/h3>\nThe thesis of this entire article rests on one inequality: per-branch restore + decode is way, less expensive than a redundant doc prefill.<\/em> The harness measures this immediately on the combination degree \u2014 the per-branch wall clock (allocate + copy + restore + decode, finish to finish) is 71\u201383 ms<\/strong> relying on which department we have a look at, in opposition to a redundant prefill value of ~4,339 ms<\/strong>. A ~52\u00d7 ratio on the combination degree is what makes every part else on this article work.<\/p>\n For extra outcomes and numbers, I\u2019ll suggest to take a look at immediately on the instance run report<\/a>.<\/p>\n \n6. \u201cOK, however how is that this completely different from vLLM \/ prefix caching \/ SGLang RadixAttention?\u201d<\/h2>\nA really affordable query, and price answering immediately, as a result of the inference-infra world has a whole lot of overlapping primitives and an HPC reader will ask this within the first remark.<\/p>\n \nvLLM \/ steady batching \/ paged consideration.<\/strong> Optimized for multi-tenant decode-time serving: many concurrent requests at completely different decode steps, scheduling the subsequent token throughout them below streaming load. Headline primitive: paged consideration. Unit of labor: a streaming firehose of impartial person prompts.<\/li>\n TGI \/ vLLM prefix caching.<\/strong> Wonderful in case your shared prefix is request-scoped or session-scoped. Not designed to reveal KV snapshots as first-class objects you’ll be able to hand to a distinct llama_context<\/code> operating a distinct downstream activity in the identical course of.<\/li>\n SGLang RadixAttention.<\/strong> Tree-shaped prefix sharing inside a serving runtime \u2014 the closest cousin, however it’s a server<\/em>, not a single-process orchestration primitive.<\/li>\n llama.cpp\u2019s personal state save\/restore.<\/strong> Exists, per-context. SwarmKV is the pipeline-level glue: a DAG, a host-buffer area sized by the engine itself, a memcpy fan-out, a LlamaGuard<\/code> coverage, and a documented no-op patiently awaiting an upstream bind API.<\/li>\n<\/ul>\n \n7. So\u2026 how do I really strive it?<\/h2>\nProperly, I already posted the Github hyperlink initially of the article. If in case you have come up to now down, please work laborious yet another time and scroll again as much as the highest.<\/p>\n Artifacts land below examples\/example-run-results\/<\/code>: best_run.json<\/code>, all_trials.csv<\/code>, plots\/*.png<\/code>, and a story final_result.docx<\/code> that walks by way of methodology and limitations. <\/p>\n Necessities: Linux, CUDA toolkit, an NVIDIA GPU (Pascal or newer; shopper or datacenter each work), a GGUF mannequin that matches in your VRAM, and the endurance to learn a CMake file as soon as.<\/p>\n \n8. Plot twist \u2014 that is simply SIB broadcast in a transformer costume<\/h2>\n<\/figure>\nI ought to in all probability confess at this level: I’m not a \u201cGPU particular person\u201d by coaching. I got here up by way of telecom \u2014 5G NR with a foot creeping firmly into 6G analysis \u2014 and I began taking a look at LLM inference infrastructure as a result of each downside on this codebase felt unusually acquainted.<\/p>\n One-sentence decoder ring for readers with out a 3GPP background:<\/strong> in a 5G community, the cell tower doesn’t unicast community configuration to each cellphone individually \u2014 it broadcasts<\/em> a small set of System Info Blocks (SIB1, SIB2, \u2026) as soon as on a shared channel, each cellphone in vary reads the identical broadcast, and per-user information rides on high of that shared context on a devoted channel. The acronyms within the desk under \u2014 MIB<\/strong> (Grasp Info Block, the very very first thing each cellphone reads), PBCH<\/strong> and PDSCH<\/strong> (the shared broadcast and downlink information channels), HARQ<\/strong> (the receiver\u2019s \u201chold what we already decoded, solely re-send what was lacking\u201d retransmission protocol), and RNTI<\/strong> (the short-term ID that distinguishes one cellphone\u2019s visitors from one other\u2019s) \u2014 are simply names for the channels and identifiers that separate shared, computed as soon as<\/em> from distinctive per shopper<\/em>. That distinction is the entire analogy.<\/p>\n Have a look at this side-by-side and inform me with a straight face these are completely different issues:<\/p>\n \n\n\n\n\n\n\n\n\n\n\n\n5G NR cell broadcast (on the gNB)<\/th>\n SwarmKV (on the GPU)<\/th>\n<\/tr>\n<\/thead>\n One MIB on PBCH per SS burst<\/td>\n One shared doc tokenized as soon as<\/td>\n<\/tr>\n Repeated SIBs (SIB1, SIB2, \u2026) on PDSCH<\/td>\n Serialized KV snapshot in MemoryPool<\/td>\n<\/tr>\n Each camped UE within the cell reads the identical SI<\/td>\n Each analytical department reads the identical snapshot<\/td>\n<\/tr>\n UE-specific devoted PDSCH for unicast person information<\/td>\n Per-branch llama_context<\/code> decoding the department immediate<\/td>\n<\/tr>\n RNTI per UE distinguishes unicast streams<\/td>\n Per-branch buffer + sequence id distinguishes department state<\/td>\n<\/tr>\n HARQ soft-buffer retained throughout retransmissions<\/td>\n KV snapshot retained throughout branches<\/td>\n<\/tr>\n Skip broadcast \u2192 each UE forces unicast SI \u2192 air interface melts<\/td>\n Skip snapshot \u2192 each department re-prefills the doc \u2192 GPU melts<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\nA fast apart to 2 very completely different audiences<\/h3>\nTo my HPC and CUDA-first buddies studying this: I do know. KV reuse isn’t a brand new thought. vLLM has prefix caching, SGLang has RadixAttention, llama.cpp itself exposes state save\/restore. SwarmKV\u2019s contribution isn’t the primitive<\/em>; it’s the single-process orchestration form<\/em> \u2014 a tiny C++ DAG runtime that exposes \u201cprefill as soon as, fan out N branches\u201d as a first-class operation, sized for one 8 GiB shopper GPU, with the protection rails (LlamaGuard<\/code>, swarmkv_validate_context_budget<\/code>, the documented bind no-op) {that a} researcher really must ship a demo on a Tuesday. Please put the pitchforks down.<\/p>\n To my telecom buddies: if \u201cKV cache\u201d gave the impression of a international language till ten minutes in the past, you aren’t behind \u2014 you’re early. For twenty years our world was FPGAs, ASICs, and PRBs. We optimized spectrum, not silicon. Then AI-RAN, NWDAF, NVIDIA Aerial, the AI-RAN Alliance, and the 3GPP Rel-20 examine objects all occurred in roughly the identical eighteen months, and the subsequent decade of telecom careers now calls for being bilingual between spectrum-world and GPU-world. The instinct interprets cleanly. You’ve gotten been fanning out shared computation to many customers because the first CRS pilot. Similar animal, only a new zoo.<\/p>\n \n9. Sincere caveats (as a result of the feedback are coming)<\/h2>\nIf you happen to got here right here to seek out what’s incorrect with the mission \u2014 congratulations, the mission discovered its first reader. From the constraints part of final_result.docx<\/code> and the inline feedback within the supply:<\/p>\n \nKV staging is host-side.<\/strong> MemoryPool<\/code> allocates ggml_backend_buffer_t<\/code> from the CPU machine<\/strong> (ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU)<\/code>). Department decode nonetheless runs on the GPU; solely the snapshot transit is host-staged by way of llama_state_get_data<\/code> \u2192 memcpy \u2192 llama_state_seq_set_data<\/code>. A tool-aware materialize lives on the roadmap, blocked on the identical upstream KV bind API that bind_contiguous_cache<\/code> is ready for.<\/li>\n Shared decode mutex (below the pinned upstream revision).<\/strong> LlamaGuard<\/code> serializes each llama_*<\/code> name from employee threads. Beneath the llama.cpp<\/code> revision and GPU configuration used on this mission, concurrent decode from a number of threads on a single GPU was not reliably secure \u2014 the precise behaviour relies on backend, model, and graph scheduling, however in our setup the conservative selection was a world lock. The DAG-level concurrency is actual, however per-request GPU compute stays sequential. That is the only largest efficiency limitation in V1, and it’s precisely the place Half 2 of this collection picks up.<\/li>\n SwarmKV_Prefill_Ms<\/code> studies 0.<\/strong> Recognized instrumentation hole in how OrchestratorContext::node_name<\/code> is consumed inside PrefillNode<\/code>. The prefill ran<\/em> (you see its value in End_To_End_Ms<\/code> and the derived efficient shared-prefill value), it’s simply not being keyed accurately into timings_ms<\/code>. The efficient shared prefill is calculated as SwarmKV_End_To_End_Ms \u2212 max(SwarmKV_AgentA_Ms, SwarmKV_AgentB_Ms)<\/code> \u2248 5,189 ms. Reporting bug, not correctness bug. Logged.<\/li>\n Artificial doc.<\/strong> The benchmark builds a deterministic 3,501-token doc by repeating \u201cThe fast brown fox jumps over the lazy canine. \u201d till the token goal is hit. This isolates the efficiency sign from content material results and retains trials reproducible bit-for-bit. Actual paperwork will produce noisier per-trial absolute timings; the structural ratios<\/em> won’t transfer.<\/li>\n Single GPU class.<\/strong> All numbers within the report come from one Pascal-class GTX 1080. Newer GPUs (Ada, Hopper) prefill a lot quicker \u2014 absolutely the ms numbers will shrink, however the structural ratio between full-prefill value and short-decode value (which is what SwarmKV exploits) doesn’t.<\/li>\n bind_contiguous_cache<\/code> is a documented no-op.<\/strong> Sure, nonetheless. Till upstream lands a steady external-KV attachment API, the perform validates its arguments, casts them to void<\/code>, and goes dwelling.<\/li>\n<\/ol>\nHowever don\u2019t fear, every part on this checklist is on the roadmap. None of it adjustments the headline consequence although. The purpose of placing it in writing is that you shouldn’t must dig for it<\/em> \u2014 and the second a benchmark weblog put up hides its caveats is the second its numbers cease being reliable.<\/p>\n \n10. The V1 ceiling (and the setup for Half 2)<\/h2>\n<\/figure>\nSwarmKV proves you could cease re-prefilling. However for those who reread caveat #2, you’ve gotten already noticed the subsequent ceiling: the GPU compute itself remains to be serialized.<\/strong><\/p>\n Here’s what really occurs on the wall clock. The DAG-level concurrency is real \u2014 branches are actual std::async<\/code> staff with actual dependency gating. However each department\u2019s llama_decode<\/code> runs inside LlamaGuard<\/code>, a single world mutex. So whereas the orchestration<\/em> followers out, the GPU work<\/em> traces up single file. Two branches take turns. Fifty branches take fifty turns. The GPU is rarely really shared; it’s time-multiplexed by hand, one lock at a time, with no equity assure and no strategy to measure who’s ravenous whom.<\/p>\n That’s high-quality for a two-agent demo. It falls aside the second you run the workload SwarmKV is definitely constructed for: 50 specialised micro-agents competing for one GPU.<\/strong> At that scale you cease caring about \u201cdid we keep away from re-prefill\u201d and begin caring about questions a hand-rolled mutex can’t reply:<\/p>\n \nWhen 50 brokers need the GPU directly, who goes first, and the way can we make it truthful?<\/li>\n What’s the p50, p95, and p99 latency every agent sees whereas sharing one card?<\/li>\n How a lot jitter does competition add, and the place does throughput collapse?<\/li>\n How can we slice GPU compute cycles on function as an alternative of accidentally?<\/li>\n<\/ul>\nThat’s Half 2 of this collection: Time-Slicing the GPU for Concurrent Agent Swarms.<\/strong> Toy brokers run sequentially in Python. Manufacturing brokers run concurrently on naked metallic, and managing VRAM and compute when many micro-agents share one NVIDIA GPU is its personal self-discipline. Half 2 builds a Kubernetes-level time-slice profiler that dynamically allocates compute cycles and measures p50\/p95\/p99 latency, jitter, and throughput proxies when agentic inference workloads share a GPU by way of the Kubernetes Machine Plugin with CUDA time-slicing. The worldwide mutex in SwarmKV is strictly the factor it replaces with one thing measurable.<\/p>\n (For the curious: there’s a separate, orthogonal V1 limitation value a future SwarmKV V2 put up \u2014 the pipeline at the moment waits for the total<\/strong> prefill to complete earlier than any department begins, even when a department solely wants the primary 500 tokens of context. Letting branches begin the moment their required prefix slice is materialized is an actual win, however it’s its personal story and its personal benchmark. It isn’t Half 2. Half 2 is about sharing the GPU throughout many brokers; that prefill-streaming thought is a follow-up.)<\/em><\/p>\n See you in Half 2.<\/p>\n \nDisclaimer: The illustrations on this article (the hero banner, the structure diagram, the telecom-vs-SwarmKV cut up panel, and the GPU time-slicing picture) have been generated utilizing AI (Claude Opus 4.8). They’re illustrative, not photographic, and any labels seen inside the photographs are stylized moderately than authoritative \u2014 discuss with the article physique and the code itself for exact perform names, metric values, and structure particulars. <\/em><\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":" A humorous-but-real tour of SwarmKV \u2014 KV-snapshot fan-out, copy-on-fork host buffers, and tips on how to make a two-agent analytical pipeline ~1.95\u00d7 quicker (and the second department\u2019s activation latency 52\u00d7 quicker) by being mildly imply to llama.cpp. of the \u201cManufacturing-Grade Agentic Inference\u201d collection. Every half removes one form of redundant work from an agentic LLM […]<\/p>\n","protected":false},"author":2,"featured_media":15570,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[1266,74,1287,477,9367,2616,3221],"class_list":["post-15568","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-fan","tag-llm","tag-multiagent","tag-pipelines","tag-prefill","tag-sharing","tag-snapshot"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15568"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15568\/revisions"}],"predecessor-version":[{"id":15569,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15568\/revisions\/15569"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15570"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}