Giant Language Fashions (LLMs) have reworked pure language processing, however face important challenges in widespread deployment because of their excessive runtime price. On this paper, we introduce SeedLM, a novel post-training compression methodology that makes use of seeds of a pseudo-random generator to encode and compress mannequin weights. Particularly, for every block of weights, we
discover a seed that’s fed right into a Linear Suggestions Shift Register (LFSR) throughout inference to effectively generate a random matrix. This matrix is then linearly mixed with compressed coefficients to reconstruct the load block. SeedLM reduces reminiscence entry and leverages idle compute cycles throughout inference, successfully rushing up memory-bound duties by buying and selling compute for fewer reminiscence accesses. In contrast to state-of-the-art strategies that depend on calibration knowledge, our method is data-free and generalizes properly throughout numerous duties. Our experiments with
Llama3 70B, which is especially difficult, present zero-shot accuracy retention at 4- and 3-bit compression to be on par with or higher than state-of-the-art strategies, whereas sustaining efficiency corresponding to FP16 baselines. Moreover, FPGA-based checks exhibit that 4-bit SeedLM, as mannequin dimension will increase, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.
†Meta