{"id":10847,"date":"2026-01-16T19:30:40","date_gmt":"2026-01-16T19:30:40","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=10847"},"modified":"2026-01-16T19:30:40","modified_gmt":"2026-01-16T19:30:40","slug":"the-knowledge-high-quality-phantasm-rethinking-classifier-primarily-based-high-quality-filtering-for-llm-pretraining","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=10847","title":{"rendered":"The Knowledge-High quality Phantasm: Rethinking Classifier-Primarily based High quality Filtering for LLM Pretraining"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p>Massive-scale fashions are pretrained on large web-crawled datasets containing paperwork of combined high quality, making information filtering important. A well-liked technique is Classifier-based High quality Filtering (CQF), which trains a binary classifier to differentiate between pretraining information and a small, high-quality set. It assigns every pretraining doc a top quality rating outlined because the classifier\u2019s rating and retains solely the top-scoring ones. We offer an in-depth evaluation of CQF. We present that whereas CQF improves downstream process efficiency, it doesn&#8217;t essentially improve language modeling on the high-quality dataset. We clarify this paradox by the truth that CQF implicitly filters the high-quality dataset as properly. We additional evaluate the conduct of fashions skilled with CQF to these skilled on artificial information of accelerating high quality, obtained through random token permutations, and discover starkly totally different developments. Our outcomes problem the view that CQF captures a significant notion of knowledge high quality.<\/p>\n<ul class=\"links-stacked\">\n<li>\u2021 Work finished whereas at Apple<\/li>\n<li>\u00a7 Oxford College<\/li>\n<\/ul>\n<figure id=\"figure1\" class=\"\" aria-label=\"Figure 1\">\n<div class=\"bg-gray-light text-base rounded\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mlr.cdn-apple.com\/media\/fig_cds_iclr_converted_41565e3a83.png\" aria-label=\"Diagram showing the Classifier-based Quality Filtering pipeline, including document embeddings from sBert, Artic-Embed, or FastText, classifier training, and ranking to form the filtered CQF dataset.\" tabindex=\"-1\" target=\"_blank\" class=\"mt-0\"><img decoding=\"async\" src=\"https:\/\/mlr.cdn-apple.com\/media\/fig_cds_iclr_converted_41565e3a83.png\" alt=\"Diagram showing the Classifier-based Quality Filtering pipeline, including document embeddings from sBert, Artic-Embed, or FastText, classifier training, and ranking to form the filtered CQF dataset.\" loading=\"lazy\" class=\"bg-gray-light\"\/><\/a><\/div><figcaption class=\"muted\" aria-hidden=\"true\">Determine 1: Classifier-based High quality Filtering (CQF) pipeline. A doc embedding mannequin (e.g. sBert, Artic-Embed, or FastText) embeds paperwork from a high-quality dataset and the pretraining set. A binary classifier is skilled on these embeddings to differentiate the HQ set from the pretraining set. Scores assigned by the classifier are used to rank paperwork from the pretraining set. The highest ok fraction of these paperwork constitutes the brand new filtered CQF dataset.<\/figcaption><\/figure>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Massive-scale fashions are pretrained on large web-crawled datasets containing paperwork of combined high quality, making information filtering important. A well-liked technique is Classifier-based High quality Filtering (CQF), which trains a binary classifier to differentiate between pretraining information and a small, high-quality set. It assigns every pretraining doc a top quality rating outlined because the classifier\u2019s [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":10849,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[7426,7424,7427,7425,74,4487,5848,5173],"class_list":["post-10847","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-classifierbased","tag-dataquality","tag-filtering","tag-illusion","tag-llm","tag-pretraining","tag-quality","tag-rethinking"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/10847","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=10847"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/10847\/revisions"}],"predecessor-version":[{"id":10848,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/10847\/revisions\/10848"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/10849"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=10847"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=10847"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=10847"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-12 15:41:29 UTC -->