{"id":12771,"date":"2026-03-16T05:04:10","date_gmt":"2026-03-16T05:04:10","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=12771"},"modified":"2026-03-16T05:04:10","modified_gmt":"2026-03-16T05:04:10","slug":"p-eagle-sooner-llm-inference-with-parallel-speculative-decoding-in-vllm","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=12771","title":{"rendered":"P-EAGLE: Sooner LLM inference with Parallel Speculative Decoding in vLLM"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div id=\"\">\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2503.01840\" target=\"_blank\" rel=\"noopener noreferrer\">EAGLE<\/a> is the state-of-the-art technique for speculative decoding in massive language mannequin (LLM) inference, however its autoregressive drafting creates a hidden bottleneck: the extra tokens that you just speculate, the extra sequential ahead passes the drafter wants. Ultimately these overhead eats into your positive factors. P-EAGLE removes this ceiling by producing all Okay draft tokens in a single ahead go, delivering as much as 1.69x speedup over vanilla EAGLE-3 on actual workloads on NVIDIA B200.<\/p>\n<p>You may unlock this efficiency acquire by downloading (or coaching) a parallel-capable drafter head, including <code>\u201cparallel_drafting\u201d: true<\/code> on you vLLM serving pipeline. Pre-trained P-EAGLE heads are already obtainable on HuggingFace for <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/amazon\/gpt-oss-120b-p-eagle\" target=\"_blank\" rel=\"noopener noreferrer\">GPT-OSS 120B<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/amazon\/GPT-OSS-20B-P-EAGLE\" target=\"_blank\" rel=\"noopener noreferrer\">GPT-OSS 20B<\/a>, and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/amazon\/Qwen3-Coder-30B-A3B-Instruct-P-EAGLE\" target=\"_blank\" rel=\"noopener noreferrer\">Qwen3-Coder 30B<\/a>, so you can begin as we speak.<\/p>\n<p>On this put up, we clarify how P-EAGLE works, how we built-in it into vLLM ranging from <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/vllm-project\/vllm\/releases\/tag\/v0.16.0\" target=\"_blank\" rel=\"noopener noreferrer\">v0.16.0<\/a> (PR#32887), and the right way to serve it with our pre-trained checkpoints. Right here is the checklist of artifacts used:<\/p>\n<div id=\"attachment_126059\" style=\"width: 1090px\" class=\"wp-caption alignnone\">\n        <img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-126059\" class=\"size-full wp-image-126059\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/ML-20619-image-1.png\" alt=\"Figure 1: P-EAGLE over other methods on SPEED-BENCH with Concurrency of 1 on one NVIDIA B200 card.\" width=\"1080\" height=\"668\"\/><\/p>\n<p id=\"caption-attachment-126059\" class=\"wp-caption-text\">Determine 1: P-EAGLE over different strategies on SPEED-BENCH with Concurrency of 1 on one NVIDIA B200 card.<\/p>\n<\/p><\/div>\n<h2><strong>Fast begin P-EAGLE:<\/strong><\/h2>\n<p>You may allow parallel drafting with a single configuration change within the <code>SpeculativeConfig<\/code> class:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-code\"># vllm\/config\/speculative.py\n   parallel_drafting: bool = True<\/code><\/pre>\n<\/p><\/div>\n<p>Right here\u2019s an instance command in vLLM to allow parallel drafting with P-EAGLE as drafter:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-code\">  vllm serve openai\/gpt-oss-20b \n   --speculative-config '{\"technique\": \"eagle3\", \"mannequin\": \"amazon\/gpt-oss-20b-p-eagle\", \"num_speculative_tokens\": 5, \"parallel_drafting\": true}'<\/code><\/pre>\n<\/p><\/div>\n<h2><strong>EAGLE\u2019s Drafting Bottleneck<\/strong><\/h2>\n<p>EAGLE achieves 2\u20133\u00d7 speedups over commonplace autoregressive decoding and is extensively deployed in manufacturing inference frameworks together with vLLM, SGLang, and TensorRT-LLM.EAGLE drafts tokens autoregressively. To provide Okay draft tokens, it requires Okay ahead passes by way of the draft mannequin. As drafter fashions get higher at drafting lengthy outputs, this drafting overhead turns into important\u2014the drafter\u2019s latency scales linearly with hypothesis depth, constraining how aggressively we will speculate.<\/p>\n<h2><strong>Our Strategy: Parallel-EAGLE (P-EAGLE)<\/strong><\/h2>\n<p>We current P-EAGLE, which transforms EAGLE from autoregressive to parallel draft era. On B200 GPUs, P-EAGLE achieves 1.05\u00d7\u20131.69\u00d7 speedup over vanilla EAGLE-3 on GPT-OSS 20B over MT-Bench, HumanEval, and SpeedBench. It&#8217;s now built-in into vLLM to unlock parallel speculative decoding, and able to speed up real-world deployments.<\/p>\n<p>P-EAGLE generates the Okay draft tokens in a single ahead go. Determine 2 exhibits the structure, which consists of two steps.<\/p>\n<p><strong>Step 1: Prefilling<\/strong>. The goal mannequin processes the immediate and generates a brand new token, as it could throughout regular inference. Alongside the way in which, P-EAGLE captures the mannequin\u2019s inner hidden states: <code>h_prompt<\/code> for every immediate place, and <code>h_context<\/code> for the newly generated token. These hidden states encode what the goal mannequin \u201cis aware of\u201d at every place and can information the drafter\u2019s predictions. This step is similar to autoregressive EAGLE.<\/p>\n<p><strong>Step 2: P-EAGLE Drafter<\/strong>. The drafter constructs inputs for every place in parallel. Every enter consists of a token embedding concatenated with a hidden state.<\/p>\n<p>For immediate positions, the enter pairs every immediate token embedding <code>emb(p)<\/code> with its corresponding <code>h_prompt<\/code> from the goal mannequin. Following the identical conference as autoregressive EAGLE, positions are shifted by one. Place i receives the token and hidden state from place i-1, enabling it to foretell the token at place i.<\/p>\n<p>For place 1, Subsequent-Token-Prediction (NTP), the enter pairs the newly generated token embedding emb(new) with <code>h_context<\/code>. This place operates identically to the usual autoregressive EAGLE.For positions 2 by way of Okay, Multi-Token-Prediction (MTP), the required inputs\u2014the token embedding and hidden state\u2014don&#8217;t but exist. P-EAGLE fills these with two learnable parameters: a shared masks token embedding <code>emb(masks)<\/code> and a shared hidden state <code>h_shared<\/code>. These are mounted vectors realized throughout coaching that function impartial placeholders.<\/p>\n<p>Positions go collectively by way of N transformer layers, then by way of the language mannequin head to foretell draft tokens t1, t2, t3, and t4 in a single ahead go.<\/p>\n<div id=\"attachment_126278\" style=\"width: 510px\" class=\"wp-caption aligncenter\">\n        <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/ml-20619-img1-1.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-126278\" loading=\"lazy\" class=\"wp-image-126278 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/13\/ml-20619-img1-new.png\" alt=\"Figure 2: P-EAGLE architecture overview.\" width=\"500\" height=\"552\"\/><\/a><\/p>\n<p id=\"caption-attachment-126278\" class=\"wp-caption-text\">Determine 2: P-EAGLE structure overview.<\/p>\n<\/p><\/div>\n<h3><strong>Coaching P-EAGLE on Lengthy Sequences<\/strong><\/h3>\n<p>Trendy reasoning fashions produce lengthy outputs. As proven in Determine 3, GPT-OSS 120B generates sequences (together with prompts) with a median size of three,891 tokens and P90 of 10,800 tokens on the UltraChat dataset. Draft fashions should be educated on matching context lengths to be efficient at inference.<\/p>\n<div id=\"attachment_126280\" style=\"width: 610px\" class=\"wp-caption aligncenter\">\n        <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/ML-20619-IMG2.png\"><img decoding=\"async\" aria-describedby=\"caption-attachment-126280\" loading=\"lazy\" class=\"wp-image-126280 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/13\/ML-20619-IMG2-1.png\" alt=\"Figure 3: Sequence length (prompt + generation) distribution on UltraChat dataset with GPT-OSS 120B. Reasoning level: Medium.\" width=\"600\" height=\"338\"\/><\/a><\/p>\n<p id=\"caption-attachment-126280\" class=\"wp-caption-text\">Determine 3: Sequence size (immediate + era) distribution on UltraChat dataset with GPT-OSS 120B. Reasoning degree: Medium.<\/p>\n<\/p><\/div>\n<p>A key problem is that parallel drafting amplifies reminiscence necessities throughout coaching. Coaching Okay parallel teams on a sequence of size N creates N \u00d7 Okay complete positions. With N = 8,192 and Okay = 8, a single coaching instance incorporates 65,536 positions. Consideration requires every place to attend to each legitimate place\u201465K \u00d7 65K means over 4 billion parts, consuming 8GB in bf16.<\/p>\n<p>Place sampling [<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2504.18583\" target=\"_blank\" rel=\"noopener noreferrer\">An et al., 2025<\/a>] reduces reminiscence by randomly skipping positions, however skipping too aggressively degrades draft high quality. Gradient accumulation is the usual resolution for memory-constrained coaching, however it splits throughout totally different coaching examples. When a single sequence exceeds reminiscence, there\u2019s nothing to separate.<\/p>\n<p>P-EAGLE introduces a sequence partition algorithm for intra-sequence splitting. The algorithm divides the N \u00d7 Okay place sequence into contiguous chunks, maintains right consideration dependencies throughout chunk boundaries, and accumulates gradients throughout chunks of the identical sequence. For particulars, see the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2602.01469\" target=\"_blank\" rel=\"noopener noreferrer\">P-EAGLE paper<\/a>.<\/p>\n<h3><strong>Implementation in vLLM<\/strong><\/h3>\n<h3><strong>Parallel drafting challenges<\/strong><\/h3>\n<p>In lots of speculative decoding setups, drafting and verification share the identical per-request token format. That\u2019s largely true for EAGLE: the drafter consumes a window that already matches what the verifier will test; Okay drafted tokens and one further sampled token.<\/p>\n<p>Parallel drafting breaks that consistency. To foretell Okay tokens in a single drafter ahead go, we append MASK placeholders (for instance, [token, MASK, MASK, \u2026]). These further positions exist just for drafting, so the draft batch form not matches the verification batch form. As a result of we will\u2019t reuse verification metadata, we should rebuild the batch metadata. We increase the enter token IDs, hidden states, and positions to insert slots for masks tokens\/embeddings, increment positions per request, then recompute the slot mapping and per-request begin indices from the up to date positions.<\/p>\n<h2><strong>The Triton Kernel<\/strong><\/h2>\n<p>To offset the overhead of rebuilding the batch metadata, we implement a fused Triton kernel that populates the drafter\u2019s enter batch on-GPU by copying and increasing the target-model batch. In a single go, the kernel copies the earlier token IDs and positions from the goal batch into new vacation spot slots and inserts the per-request bonus token sampled by the goal mannequin. It then fills the additional parallel-drafting slots with a particular MASK token ID. Lastly, it generates light-weight metadata: a rejected-token masks, a masked-token masks for parallel drafting slots, new-token indices for sampling draft tokens, and a hidden-state mapping.<\/p>\n<p>This logic would in any other case be many GPU ops (copy\/scatter + insert + fill + masks + remap). Fusing it into one kernel reduces launch overhead and further reminiscence site visitors, retaining the drafting setup low cost.<\/p>\n<h2><strong>Hidden State Administration<\/strong><\/h2>\n<p>For EAGLE-based strategies that go hidden states to the draft mannequin, parallel drafting handles populating these fields individually. Since hidden states are considerably bigger than the remainder of the enter batch, we cut up the work: the Triton kernel outputs a mapping, and a devoted copy kernel broadcasts the realized hidden state placeholder into the masks token slots.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\"># Copy goal hidden states to their new positions\n self.hidden_states[out_hidden_state_mapping] = target_hidden_states\n \n# Fill masked positions with the realized Parallel Drafting hidden state\n masks = self.is_masked_token_mask[:total_num_output_tokens]\n torch.the place(\n     masks.unsqueeze(1),\n     self.parallel_drafting_hidden_state_tensor,\n     self.hidden_states[:total_num_output_tokens],\n     out=self.hidden_states[:total_num_output_tokens],\n )<\/code><\/pre>\n<\/p><\/div>\n<p>The <code>parallel_drafting_hidden_state_tensor<\/code> is loaded from the mannequin\u2019s <code>mask_hidden<\/code> buffer, a realized illustration that tells the mannequin these positions ought to predict future tokens.<\/p>\n<p>For KV cache slot mapping, legitimate tokens obtain regular slot project whereas rejected tokens are mapped to <code>PADDING_SLOT_ID (-1)<\/code> to forestall spurious cache writes. For CUDA graphs, we lengthen the seize vary by <code>Okay \u00d7 max_num_seqs<\/code> to accommodate the bigger draft batch launched by parallel drafting.<\/p>\n<h2><strong>vLLM Benchmarking on P-EAGLE<\/strong><\/h2>\n<p>We practice P-EAGLE on GPT-OSS-20B and consider throughout three benchmarks: <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2402.14762\" target=\"_blank\" rel=\"noopener noreferrer\">MT-Bench<\/a> for multi-turn instruction following, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/datasets\/nvidia\/SPEED-Bench\" target=\"_blank\" rel=\"noopener noreferrer\">SPEED-Bench<\/a> Code for long-term code era, and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/openai\/human-eval\" target=\"_blank\" rel=\"noopener noreferrer\">HumanEval<\/a> for function-level code synthesis. P-EAGLE delivers 55\u201369% greater throughput at low concurrency (c=1), with positive factors of 5\u201325% sustained at excessive concurrency (c=64), in comparison with the publicly obtainable vanilla <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/RedHatAI\/gpt-oss-20b-speculator.eagle3\" target=\"_blank\" rel=\"noopener noreferrer\">EAGLE-3 checkpoint<\/a>. Outcomes are proven in Determine 4-6.<\/p>\n<p>The P-EAGLE drafter is a light-weight 4-layer mannequin educated to foretell as much as 10 tokens in parallel. To judge efficiency, we sweep hypothesis depths Okay \u2208 {3,5,7} throughout concurrency ranges C \u2208 {1,2,4,8,16,32,64}. Our purpose is to establish the suitable deployment configuration for each P-EAGLE and vanilla EAGLE-3. Linear drafting is used for each P-EAGLE and vanilla EAGLE-3. On this context, \u201cfinest P-EAGLE\u201d and \u201cfinest EAGLE-3\u201d confer with the configurations that obtain peak throughput. These are measured in tokens per second (TPS), for a given hypothesis depth Okay. For every technique, we choose Okay that maximizes TPS beneath the given serving situations.<\/p>\n<p>A constant sample emerges. P-EAGLE achieves peak TPS at Okay=7 throughout all concurrency ranges. In distinction, vanilla EAGLE-3 reaches its highest TPS at Okay=3, with its improved depth often shifting towards greater values relying on concurrency. This conduct displays a elementary benefit of parallel drafting. P-EAGLE generates all Okay draft tokens in a single ahead go, permitting it to learn from deeper hypothesis with out incurring further sequential overhead. Autoregressive drafters, in contrast, should generate speculative tokens step-by-step, which limits their capability to effectively scale to bigger Okay.<\/p>\n<p>All experiments are performed on one NVIDIA B200 (Blackwell) GPU utilizing vLLM with the next serving configuration.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-code\">VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \nvllm serve openai\/gpt-oss-20b \n    --speculative-config '{\n      \"technique\": \"eagle3\",\n      \"mannequin\": \"amazon\/GPT-OSS-20B-P-EAGLE\",\n      \"num_speculative_tokens\": 7,\n      \"parallel_drafting\": true}' \n    --port 8000 \n    --max-num-seqs 1024 \n    --max-model-len 100000 \n    --max-num-batched-tokens 100000 \n    --max-cudagraph-capture-size 4096 \n    --no-enable-prefix-caching \n    --no-enable-chunked-prefill \n    --kv-cache-dtype fp8 \n    --async-scheduling \n    --stream-interval 20<\/code><\/pre>\n<\/p><\/div>\n<p><strong>Observe<\/strong>. Serving GPT-OSS-20B with EAGLE drafters at the moment requires a one-line vLLM patch (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/vllm-project\/vllm\/pull\/36684\" target=\"_blank\" rel=\"noopener noreferrer\">PR#36684<\/a>). Apply it earlier than launching. This repair is predicted to land in an upcoming vLLM launch.<\/p>\n<div id=\"attachment_126062\" style=\"width: 968px\" class=\"wp-caption alignnone\">\n        <img decoding=\"async\" aria-describedby=\"caption-attachment-126062\" loading=\"lazy\" class=\"size-full wp-image-126062\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/ML-20619-image-4.png\" alt=\"Figure 4: MT-Bench throughput (TPS) for P-EAGLE vs EAGLE-3 on GPT-OSS-20B across concurrency levels. The P\/E speedup ratios are: 1.55x (c=1), 1.29x (c=2), 1.35x (c=4), 1.28x (c=8), 1.27x (c=16), 1.09x (c=32), and 1.05x (c=64).\" width=\"958\" height=\"593\"\/><\/p>\n<p id=\"caption-attachment-126062\" class=\"wp-caption-text\">Determine 4: MT-Bench throughput (TPS) for P-EAGLE vs EAGLE-3 on GPT-OSS-20B throughout concurrency ranges. The P\/E speedup ratios are: 1.55x (c=1), 1.29x (c=2), 1.35x (c=4), 1.28x (c=8), 1.27x (c=16), 1.09x (c=32), and 1.05x (c=64).<\/p>\n<\/p><\/div>\n<div id=\"attachment_126063\" style=\"width: 967px\" class=\"wp-caption alignnone\">\n        <img decoding=\"async\" aria-describedby=\"caption-attachment-126063\" loading=\"lazy\" class=\"size-full wp-image-126063\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/ML-20619-image-5.png\" alt=\"Figure 5: HumanEval throughput (TPS) for P-EAGLE vs EAGLE-3 on GPT-OSS-20B across concurrency levels. The P\/E speedup ratios are: 1.55x (c=1), 1.53x (c=2), 1.45x (c=4), 1.35x (c=8), 1.31x (c=16), 1.37x (c=32), and 1.23x (c=64).\" width=\"957\" height=\"594\"\/><\/p>\n<p id=\"caption-attachment-126063\" class=\"wp-caption-text\">Determine 5: HumanEval throughput (TPS) for P-EAGLE vs EAGLE-3 on GPT-OSS-20B throughout concurrency ranges. The P\/E speedup ratios are: 1.55x (c=1), 1.53x (c=2), 1.45x (c=4), 1.35x (c=8), 1.31x (c=16), 1.37x (c=32), and 1.23x (c=64).<\/p>\n<\/p><\/div>\n<div id=\"attachment_126064\" style=\"width: 1111px\" class=\"wp-caption alignnone\">\n        <img decoding=\"async\" aria-describedby=\"caption-attachment-126064\" loading=\"lazy\" class=\"size-full wp-image-126064\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/ML-20619-image-6.png\" alt=\"Figure 6: Speed-bench throughput (TPS) for P-EAGLE vs EAGLE-3 on GPT-OSS-20B across concurrency levels. The P\/E speedup ratios are: 1.69x (c=1), 1.61x (c=2), 1.54x (c=4), 1.45x (c=8), 1.40x (c=16), 1.22x (c=32), and 1.25x (c=64).\" width=\"1101\" height=\"681\"\/><\/p>\n<p id=\"caption-attachment-126064\" class=\"wp-caption-text\">Determine 6: Pace-bench throughput (TPS) for P-EAGLE vs EAGLE-3 on GPT-OSS-20B throughout concurrency ranges. The P\/E speedup ratios are: 1.69x (c=1), 1.61x (c=2), 1.54x (c=4), 1.45x (c=8), 1.40x (c=16), 1.22x (c=32), and 1.25x (c=64).<\/p>\n<\/p><\/div>\n<p>Along with lowering drafting overhead, P-EAGLE\u2019s throughput positive factors are additionally pushed by higher acceptance size (AL), the typical variety of draft tokens accepted by the verifier per hypothesis spherical. Increased AL means extra of the draft work turns into actual output, which straight boosts efficient OTPS\/TPS.<\/p>\n<p>The next tables examine AL for P-EAGLE and vanilla EAGLE-3 on GPT-OSS-20B throughout our three benchmarks:<\/p>\n<p>P-EAGLE (AL):<\/p>\n<table class=\"styled-table\" border=\"1px\" cellpadding=\"10px\">\n<tbody>\n<tr>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">Config<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">HumanEval<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">SPEED-Bench<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">MT-Bench<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">Okay=3<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">3.02<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">2.87<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">2.87<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">Okay=7<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">3.94<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">3.38<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">3.70<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>EAGLE-3 (AL):<\/p>\n<table class=\"styled-table\" border=\"1px\" cellpadding=\"10px\">\n<tbody>\n<tr>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">Config<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">HumanEval<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">SPEED-Bench<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">MT-Bench<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">Okay=3<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">2.65<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">2.24<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">2.70<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">Okay=7<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">3.03<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">2.59<\/td>\n<td style=\"padding: 10px;border: 1px solid #dddddd\">3.27<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>P-EAGLE constantly achieves greater AL than EAGLE-3 on the identical hypothesis depth Okay. At Okay=7, P-EAGLE outperforms EAGLE-3 by 30% on HumanEval (3.94 vs 3.03), 31% on SPEED-Bench (3.38 vs 2.59), and 13% on MT-Bench (3.70 vs 3.27).Notably, P-EAGLE advantages extra from deeper hypothesis. From Okay=3 to Okay=7, P-EAGLE\u2019s AL will increase by 0.92 on HumanEval (3.02 to three.94), whereas EAGLE-3 positive factors solely 0.38 (2.65 to three.03). This widening hole at greater Okay is in line with P-EAGLE\u2019s single-pass parallel drafting, which incurs no further price from deeper hypothesis.<\/p>\n<p><strong>Reproducing the Outcomes<\/strong><\/p>\n<p>After launching the server, run benchmarks with `vllm bench serve`:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-code\">#MT-Bench                                                                           \nexport MODEL=\"openai\/gpt-oss-20b\"\nexport BASE_URL=\"http:\/\/localhost:8000\" \nvllm bench serve \n    --dataset-name hf \n    --dataset-path philschmid\/mt-bench \n    --num-prompts 80 \n    --max-concurrency 1 \n    --model $MODEL \n    --base-url $BASE_URL \n    --temperature 0.0 \n    --hf-output-len 2048\n \n#HumanEvalcommand: \n#Obtain HumanEval dataset openai\/openai_humaneval                            \nvllm bench serve \n    --dataset-name {custom}                                                             \n    --dataset-path <dataset path=\"\"> \n    --num-prompts 164 \n    --max-concurrency 1 \n    --model $MODEL \n    --base-url $BASE_URL \n    --temperature 0.0 \n    --custom-output-len 2048\n<\/dataset><\/code><\/pre>\n<\/p><\/div>\n<p>P-EAGLE removes the sequential bottleneck from speculative decoding, delivering as much as 1.69\u00d7 speedup over vanilla EAGLE-3 on actual workloads. By decoupling draft rely from ahead go rely, we will now discover bigger drafting architectures, which might even allow elevated acceptance charges in comparison with single-layer baselines. This implementation fastidiously handles the complexities of enter preparation, consideration metadata administration, and KV cache slot mapping by way of hand-written fused kernels. Whereas it requires specifically educated fashions, the efficiency advantages make it a helpful addition to vLLM\u2019s speculative decoding capabilities.<\/p>\n<p>As extra parallel-trained fashions change into obtainable, we anticipate this method to change into the popular selection for manufacturing LLM deployments. The mix of P-EAGLE\u2019s architectural effectivity and vLLM\u2019s strong infrastructure supplies a transparent path for these in search of most inference efficiency and diminished latency.<\/p>\n<p>Strive it as we speak: obtain a pre-trained P-EAGLE head from HuggingFace, set <code>\"parallel_drafting\": true<\/code> in your vLLM config for any of the supported fashions, and see the speedup for your self.<\/p>\n<h3>Acknowledgement<\/h3>\n<p>We wish to acknowledge our contributors and collaborators from Nvidia: Xin Li, Kaihang Jiang, Omri Almog, and our staff members: Ashish Khetan, and George Karypis.<\/p>\n<hr style=\"width: 80%\"\/>\n<h2>Concerning the authors<\/h2>\n<footer>\n<div class=\"blog-author-box\">\n<div class=\"blog-author-image\">\n          <img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-69018 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2023\/12\/26\/ML-16049-xinhuang.jpg\" alt=\"\" width=\"100\" height=\"109\"\/>\n         <\/div>\n<h3 class=\"lb-h4\">Dr. Xin Huang<\/h3>\n<p>Dr. Xin Huang is a Senior Utilized Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on creating scalable machine studying algorithms. His analysis pursuits are within the space of pure language processing, explainable deep studying on tabular information, and strong evaluation of non-parametric space-time clustering. He has revealed many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society.<\/p>\n<\/p><\/div>\n<div class=\"blog-author-box\">\n<div class=\"blog-author-image\">\n          <img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-105351\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/04\/28\/florian100.jpg\" alt=\"\" width=\"100\" height=\"133\"\/>\n         <\/div>\n<h3 class=\"lb-h4\">Dr. Florian Saupe<\/h3>\n<p>Dr. Florian Saupe is a Principal Technical Product Supervisor at AWS AI\/ML analysis engaged on mannequin inference optimization, massive scale distributed coaching, and fault resilience. Earlier than becoming a member of AWS, Florian lead technical product administration for automated driving at Bosch, was a technique marketing consultant at McKinsey &amp; Firm, and labored as a management methods and robotics scientist\u2014a area wherein he holds a PhD.<\/p>\n<\/p><\/div>\n<div class=\"blog-author-box\">\n<div class=\"blog-author-image\">\n          <img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-126082\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/Jaime-Campos-Salas.jpeg\" alt=\"\" width=\"120\" height=\"160\"\/>\n         <\/div>\n<h3 class=\"lb-h4\">Jaime Campos Salas<\/h3>\n<p>Jaime Campos Salas is a Senior Machine Studying Engineer at AWS, the place he works on advancing LLM inference effectivity. His present focus is integrating novel speculative decoding methods into manufacturing serving methods. Earlier than becoming a member of the analysis staff, he helped construct Amazon Bedrock\u2019s mannequin serving infrastructure, transport quantization, vision-language mannequin assist, and the platform\u2019s first open-source inference container.<\/p>\n<\/p><\/div>\n<div class=\"blog-author-box\">\n<div class=\"blog-author-image\">\n          <img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-126108\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/benjamin_chislett_profile_thumbnail.jpg\" alt=\"\" width=\"1370\" height=\"1370\"\/>\n         <\/div>\n<h3 class=\"lb-h4\">Benjamin Chislett<\/h3>\n<p>Benjamin Chislett is a vLLM maintainer at NVIDIA who focuses on inference runtime optimization together with speculative decoding and asynchronous execution.<\/p>\n<\/p><\/div>\n<div class=\"blog-author-box\">\n<div class=\"blog-author-image\">\n          <img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-126109\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/max.jpg\" alt=\"\" width=\"1707\" height=\"2560\"\/>\n         <\/div>\n<h3 class=\"lb-h4\">Max Xu<\/h3>\n<p>Max Xu is a Senior Engineer and Tech Lead at NVIDIA, specializing in AI Platform Software program. He brings full-stack GPU experience spanning chip design, kernel-level optimization, and information heart\u2013scale coaching, inference, and post-training\u2014translating cutting-edge improvements into real-world affect. Beforehand, Max held engineering roles at Amazon, Marvell, and AMD. He earned an M.Sc. in Electrical Engineering from the College of Southern California (VLSI focus) and a B.Sc. in Automation from Beijing Institute of Know-how.<\/p>\n<\/p><\/div>\n<div class=\"blog-author-box\">\n<div class=\"blog-author-image\">\n          <img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-126110\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/03\/12\/Zeyuan-Faradawn-Yang.png\" alt=\"\" width=\"400\" height=\"400\"\/>\n         <\/div>\n<h3 class=\"lb-h4\">Zeyuan (Faradawn) Yang<\/h3>\n<p>Zeyuan (Faradawn) Yang is a Technical Advertising and marketing Engineer at NVIDIA. He earned his M.S. in Laptop Science from the College of Chicago, the place his thesis analysis targeted on GNN-based cache system acceleration, with associated work introduced on the PyTorch Convention 2025. His experience spans large-scale LLM inference, AI methods efficiency, and graph-based optimization. Since becoming a member of the NVIDIA AI Platform Software program staff, he has pushed the optimization of LLM inference methods and developer-facing AI infrastructure.<\/p>\n<\/p><\/div>\n<\/footer>\n<p>       \n      <\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>EAGLE is the state-of-the-art technique for speculative decoding in massive language mannequin (LLM) inference, however its autoregressive drafting creates a hidden bottleneck: the extra tokens that you just speculate, the extra sequential ahead passes the drafter wants. Ultimately these overhead eats into your positive factors. P-EAGLE removes this ceiling by producing all Okay draft tokens [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":12773,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[6501,512,1028,74,7478,8253,6610,4240],"class_list":["post-12771","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-decoding","tag-faster","tag-inference","tag-llm","tag-parallel","tag-peagle","tag-speculative","tag-vllm"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12771","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=12771"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12771\/revisions"}],"predecessor-version":[{"id":12772,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12771\/revisions\/12772"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/12773"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=12771"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=12771"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=12771"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-06 16:48:08 UTC -->