The Knowledge-High quality Phantasm: Rethinking Classifier-Primarily based High quality Filtering for LLM Pretraining

Massive-scale fashions are pretrained on large web-crawled datasets containing paperwork of combined high quality, making information filtering important. A well-liked technique is Classifier-based High quality Filtering (CQF), which trains a binary classifier to differentiate between pretraining information and a small, high-quality set. It assigns every pretraining doc a top quality rating outlined because the classifier’s rating and retains solely the top-scoring ones. We offer an in-depth evaluation of CQF. We present that whereas CQF improves downstream process efficiency, it doesn’t essentially improve language modeling on the high-quality dataset. We clarify this paradox by the truth that CQF implicitly filters the high-quality dataset as properly. We additional evaluate the conduct of fashions skilled with CQF to these skilled on artificial information of accelerating high quality, obtained through random token permutations, and discover starkly totally different developments. Our outcomes problem the view that CQF captures a significant notion of knowledge high quality.

No Result