Cram Much less to Match Extra: Coaching Information Pruning Improves Memorization of Info

This paper was accepted on the Workshop on Navigating and Addressing Information Issues for Basis Fashions at ICLR 2026.

Massive language fashions (LLMs) can battle to memorize factual information of their parameters, usually resulting in hallucinations and poor efficiency on knowledge-intensive duties. On this paper, we formalize truth memorization from an information-theoretic perspective and research how coaching knowledge distributions have an effect on truth accuracy. We present that truth accuracy is suboptimal (under the capability restrict) every time the quantity of knowledge contained within the coaching knowledge info exceeds mannequin capability. That is additional exacerbated when the actual fact frequency distribution is skewed (e.g. an influence legislation). We suggest knowledge choice schemes based mostly on the coaching loss alone that goal to restrict the variety of info within the coaching knowledge and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy info, our choice methodology successfully boosts truth accuracy to the capability restrict. When pretraining language fashions from scratch on an annotated Wikipedia corpus, our choice methodology allows a GPT2-Small mannequin (110m parameters) to memorize 1.3X extra entity info in comparison with customary coaching, matching the efficiency of a 10X bigger mannequin (1.3B parameters) pretrained on the complete dataset.

No Result