Supercharging LLM inference on Google TPUs: Reaching 3X speedups with diffusion-style speculative decoding
The present panorama of Massive Language Mannequin (LLM) acceleration is dominated by autoregressive speculative decoding, the place a light-weight drafter ...







