{"id":12923,"date":"2026-03-20T19:29:23","date_gmt":"2026-03-20T19:29:23","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=12923"},"modified":"2026-03-20T19:29:23","modified_gmt":"2026-03-20T19:29:23","slug":"lumberchunker-lengthy-type-narrative-doc-segmentation-machine-studying-weblog-mlcmu","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=12923","title":{"rendered":"LumberChunker: Lengthy-Type Narrative Doc Segmentation \u2013 Machine Studying Weblog | ML@CMU"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p><strong>Hyperlinks:<\/strong><br \/><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2406.17526\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> | <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/joaodsmarques\/LumberChunker\" target=\"_blank\" rel=\"noreferrer noopener\">Code<\/a> | <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/LumberChunker\" target=\"_blank\" rel=\"noreferrer noopener\">Knowledge<\/a><\/p>\n<p>LumberChunker lets an LLM determine the place a protracted story must be cut up, creating extra pure chunks that assist Retrieval Augmented Era (RAG) techniques retrieve the proper info.<\/p>\n<h2>Introduction<\/h2>\n<p>Lengthy-form narrative paperwork often have an specific construction, reminiscent of chapters or sections, however these models are sometimes too broad for retrieval duties. At a decrease stage, vital semantic shifts occur inside these bigger segments with none seen structural break. After we cut up textual content solely by formatting cues, like paragraphs or mounted token home windows, passages that belong to the identical narrative unit could also be separated, whereas unrelated content material could be grouped collectively. This misalignment between construction and which means produces chunks that include incomplete or combined context, which reduces retrieval high quality and impacts downstream RAG efficiency. Because of this, segmentation ought to goal to create chunks which might be semantically impartial, moderately than relying solely on doc construction.<\/p>\n<p><strong>So how will we protect the story\u2019s movement and nonetheless preserve chunking sensible?<\/strong><\/p>\n<p>In lots of circumstances, a reader can simply acknowledge the place the narrative begins to shift\u2014for instance, when the textual content strikes to a distinct scene, introduces a brand new entity, or modifications its goal. The issue is that the majority automated chunking strategies don&#8217;t think about this semantic sign and as an alternative rely solely on floor construction. Consequently, they might produce segmentations that look cheap from a formatting perspective however break the underlying narrative coherence.<\/p>\n<p>To make this concrete, think about the quick passage beneath and determine the optimum chunking boundary!<\/p>\n<p><meta charset=\"utf-8\"\/><br \/>\n    <title>LumberChunker: Section 2 (Quiz)<\/title><br \/>\n    <\/p>\n<section id=\"quiz\" style=\"margin-top: 0; margin-bottom: 0; padding-top: 1rem; padding-bottom: 1rem;\">\n<div style=\"max-width: 1344px; margin: 0 auto;\">\n<div class=\"quiz-section\">\n<div class=\"quiz-container\" id=\"quizContainer\">\n<p><span class=\"step-number\">1<\/span> Learn the passage<\/p>\n<div class=\"interaction-section\">\n<p>\n                            <button class=\"quiz-btn submit-btn\" id=\"submitBtn\">Submit Reply<\/button><br \/>\n                            <button class=\"quiz-btn back-btn\" id=\"backBtn\" style=\"display: none;\">\u2190 Attempt Once more<\/button>\n                        <\/p>\n<\/p><\/div><\/div><\/div><\/div>\n<\/section>\n<h2>The LumberChunker Technique<\/h2>\n<p>Within the instance above, Choice C supplies probably the most coherent segmentation. The boundary aligns with the purpose the place the narrative turns into semantically impartial from the previous context.<\/p>\n<p>Our objective is to make this sort of segmentation resolution sensible at scale. The problem is that human-quality boundary detection requires understanding narrative context, which is pricey to use throughout 1000&#8217;s of paragraphs in long-form paperwork.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"406\" src=\"https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-1024x406.png\" alt=\"\" class=\"wp-image-22327\" srcset=\"https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-1024x406.png 1024w, https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-300x119.png 300w, https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-1536x609.png 1536w, https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-2048x812.png 2048w, https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-970x384.png 970w, https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-320x127.png 320w, https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-80x32.png 80w, https:\/\/blog.ml.cmu.edu\/wp-content\/uploads\/2026\/02\/LumberChunker_pipeline-300x119@2x.png 600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"\/><\/figure>\n<p><strong>LumberChunker<\/strong> approaches this by treating segmentation as a boundary-finding downside: given a brief sequence of consecutive paragraphs, we ask a language mannequin to establish the earliest level the place the content material clearly shifts. This formulation permits segments to differ in size whereas remaining aligned with the underlying narrative construction. In apply, LumberChunker consists of those steps:<\/p>\n<h3>1) Doc Paragraph Extraction<\/h3>\n<p>Cleanly cut up the guide into paragraphs and assign secure IDs (<code>ID:1, ID:2, \u2026<\/code>). This preserves the doc\u2019s pure discourse models and offers us secure candidate boundaries.<\/p>\n<blockquote class=\"wp-block-quote\">\n<p><strong>Instance:<\/strong> From a novel, we extract:<\/p>\n<p><code>ID:1<\/code> \u201cThe morning solar filtered by means of the dusty home windows\u2026\u201d<br \/><code>ID:2<\/code> \u201cShe walked slowly to the door, hesitating\u2026\u201d<br \/><code>ID:3<\/code> \u201cIn the meantime, throughout city, Detective Morrison reviewed the case information\u2026\u201d<br \/><code>ID:4<\/code> \u201cThe earlier evening\u2019s occasions had left him puzzled\u2026\u201d<\/p>\n<p>Every paragraph will get a singular ID for monitoring boundaries.<\/p>\n<\/blockquote>\n<h3>2) IDs Grouping for LLM<\/h3>\n<p>Construct a bunch <code>G_i<\/code> by appending paragraphs till the group\u2019s size reaches a token finances <code>\u03b8<\/code>. This supplies sufficient context for the mannequin to evaluate when a subject\/scene truly shifts.<\/p>\n<blockquote class=\"wp-block-quote\">\n<p><strong>Instance:<\/strong> With <code>\u03b8 = 550<\/code> tokens, we construct, per instance:<\/p>\n<p><code>G_1<\/code> = [<code>ID:1<\/code>, <code>ID:2<\/code>, <code>ID:3<\/code>, <code>ID:4<\/code>, <code>ID:5<\/code>, <code>ID:6<\/code>]<\/p>\n<p>This window, by spanning a number of paragraphs, will increase the prospect that at the very least one significant narrative shift is current inside the context.<\/p>\n<\/blockquote>\n<h3>3) LLM Question<\/h3>\n<p>Immediate the mannequin with the paragraphs in <code>G_i<\/code> and ask it to return the <em>first paragraph the place content material clearly modifications relative to what got here earlier than<\/em>. Use that returned ID because the chunk boundary; begin the following group at that paragraph and repeat to the tip of the guide.<\/p>\n<p><strong>Instance:<\/strong> Given <code>G_1<\/code> = [<code>p1<\/code>, <code>p2<\/code>, <code>p3<\/code>, <code>p4<\/code>, <code>p5<\/code>, <code>p6<\/code>], the LLM responds: <code>p3<\/code><\/p>\n<p><strong>Reply Extraction:<\/strong><br \/>We extract <code>p3<\/code> because the boundary. This creates:<\/p>\n<ul>\n<li><strong>Chunk 1<\/strong>: [<code>p1<\/code>, <code>p2<\/code>]<\/li>\n<li><strong>Subsequent group (<code>G_2<\/code>) begins at<\/strong> <code>p3<\/code><\/li>\n<\/ul>\n<h2>GutenQA: A Benchmark for Lengthy-Type Narrative Retrieval<\/h2>\n<p>To guage our chunking method, we introduce <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/datasets\/LumberChunker\/GutenQA\"><strong>GutenQA<\/strong><\/a>, a benchmark of <strong>100<\/strong> rigorously cleaned public-domain books paired with <strong>3,000<\/strong> needle-in-a-haystack sort of questions. This enables us to measure retrieval high quality immediately after which observe how higher retrieval results in extra correct solutions in a RAG system.<\/p>\n<figure class=\"wp-block-gallery columns-3 is-cropped\"\/>\n<h2>Key Findings<\/h2>\n<h3>Retrieval: LumberChunker leads \u2b50<\/h3>\n<p>LumberChunker leads throughout each DCG@okay and Recall@okay. By <code>okay=20<\/code>, it reaches <strong>DCG \u2248 62.1%<\/strong> and <strong>Recall \u2248 77.9%<\/strong>, displaying that higher segmentation improves not solely which passages seem first, but additionally how reliably the proper context is retrieved.<\/p>\n<p><meta charset=\"utf-8\"\/><br \/>\n    <title>LumberChunker: Section 4 (Tables)<\/title><\/p>\n<div style=\"text-align: center; margin-bottom: 1rem;\">\n<p style=\"font-weight: 600; font-size: 1.15rem; color: #464646; margin-bottom: 0.25rem;\">\n            Retrieval Efficiency Comparability<\/p>\n<\/p><\/div>\n<p>    <\/p>\n<p>    <\/p>\n<div id=\"table-container-ndcg\" class=\"table-card\">\n<table class=\"table is-bordered is-hoverable\">\n<thead>\n<tr>\n<th\/>\n<th>1<\/th>\n<th>2<\/th>\n<th>5<\/th>\n<th>10<\/th>\n<th>20<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>Semantic Chunking<\/th>\n<td>29.50<\/td>\n<td>35.31<\/td>\n<td>40.67<\/td>\n<td>43.14<\/td>\n<td>44.74<\/td>\n<\/tr>\n<tr>\n<th>Paragraph-Stage<\/th>\n<td>36.54<\/td>\n<td>42.11<\/td>\n<td>45.87<\/td>\n<td>47.72<\/td>\n<td>49.00<\/td>\n<\/tr>\n<tr>\n<th>Recursive Chunking<\/th>\n<td>39.04<\/td>\n<td>45.37<\/td>\n<td>50.66<\/td>\n<td>53.25<\/td>\n<td>54.72<\/td>\n<\/tr>\n<tr>\n<th>HyDE<sup>\u2020<\/sup><\/th>\n<td>33.47<\/td>\n<td>39.74<\/td>\n<td>45.06<\/td>\n<td>48.14<\/td>\n<td>49.92<\/td>\n<\/tr>\n<tr>\n<th>Proposition-Stage<\/th>\n<td>36.91<\/td>\n<td>42.42<\/td>\n<td>44.88<\/td>\n<td>45.65<\/td>\n<td>46.19<\/td>\n<\/tr>\n<tr style=\"background-color: #f0f8ff;\">\n<th>LumberChunker<\/th>\n<td><strong>48.28<\/strong><\/td>\n<td><strong>54.86<\/strong><\/td>\n<td><strong>59.37<\/strong><\/td>\n<td><strong>60.99<\/strong><\/td>\n<td><strong>62.09<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<p>    <\/p>\n<div id=\"table-container-recall\" class=\"table-card\" style=\"display: none;\">\n<table class=\"table is-bordered is-hoverable\">\n<thead>\n<tr>\n<th\/>\n<th>1<\/th>\n<th>2<\/th>\n<th>5<\/th>\n<th>10<\/th>\n<th>20<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>Semantic Chunking<\/th>\n<td>29.50<\/td>\n<td>38.70<\/td>\n<td>50.60<\/td>\n<td>58.21<\/td>\n<td>64.51<\/td>\n<\/tr>\n<tr>\n<th>Paragraph-Stage<\/th>\n<td>36.54<\/td>\n<td>45.37<\/td>\n<td>53.67<\/td>\n<td>59.34<\/td>\n<td>64.34<\/td>\n<\/tr>\n<tr>\n<th>Recursive Chunking<\/th>\n<td>39.04<\/td>\n<td>49.07<\/td>\n<td>60.64<\/td>\n<td>68.62<\/td>\n<td>74.35<\/td>\n<\/tr>\n<tr>\n<th>HyDE<sup>\u2020<\/sup><\/th>\n<td>33.47<\/td>\n<td>43.41<\/td>\n<td>55.11<\/td>\n<td>64.61<\/td>\n<td>71.61<\/td>\n<\/tr>\n<tr>\n<th>Proposition-Stage<\/th>\n<td>36.91<\/td>\n<td>45.64<\/td>\n<td>51.04<\/td>\n<td>53.41<\/td>\n<td>55.54<\/td>\n<\/tr>\n<tr style=\"background-color: #f0f8ff;\">\n<th>LumberChunker<\/th>\n<td><strong>48.28<\/strong><\/td>\n<td><strong>58.71<\/strong><\/td>\n<td><strong>68.58<\/strong><\/td>\n<td><strong>73.58<\/strong><\/td>\n<td><strong>77.92<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<h3>Downstream QA: Focused Retrieval Outperforms Giant Context Home windows<\/h3>\n<p>We discover that even with very giant context home windows, a non-retrieval setup nonetheless performs worse than RAG, displaying that choosing targeted, related passages is simpler than merely rising the quantity of uncooked context. Below this setting, when built-in into a regular RAG pipeline on a GutenQA subset, our <strong>RAG-LumberChunker<\/strong> is second solely to <strong>RAG-Guide<\/strong>, which makes use of hand-segmented ground-truth chunks.<\/p>\n<p><meta charset=\"utf-8\"\/><br \/>\n    <title>LumberChunker: Section 4 (Bar Chart)<\/title><br \/>\n    <\/p>\n<p>    <\/p>\n<div id=\"qa-bar-chart-container\">\n<p class=\"chart-title\">Downstream QA Accuracy (%)<\/p>\n<\/p><\/div>\n<p>    <\/p>\n<h3>A Candy Spot Round \u03b8 \u2248 550 Tokens<\/h3>\n<p>We sweep <code>\u03b8 \u2208 [450, 1000]<\/code> tokens and discover that <strong>\u03b8 \u2248 550<\/strong> persistently maximizes retrieval high quality: giant sufficient for context, sufficiently small to maintain the mannequin targeted on the present flip within the story.<\/p>\n<p><meta charset=\"utf-8\"\/><br \/>\n    <title>LumberChunker: Section 5 (Theta Slider + Line Chart)<\/title><br \/>\n    <\/p>\n<p>    <\/p>\n<p>    <\/p>\n<div id=\"dcg-chart-container\">\n<p class=\"chart-title\">DCG@okay vs Token Funds (\u03b8)<\/p>\n<\/p><\/div>\n<p>    <\/p>\n<p>This doesn&#8217;t imply the ensuing chunks are giant. In apply, because the desk exhibits, the typical chunk dimension is about <strong>334 tokens<\/strong>, which means that LumberChunker usually detects earlier semantic shifts inside the window.<\/p>\n<p><meta charset=\"utf-8\"\/><br \/>\n    <title>LumberChunker: Section 5 (Token Depend Desk)<\/title><\/p>\n<div class=\"table-card\" aria-labelledby=\"table10-caption\">\n<table class=\"table is-bordered is-hoverable\">\n<caption id=\"table10-caption\" style=\"caption-side: top; text-align: center; font-weight: 600; font-size: 0.95em; color: #555; padding-bottom: 0.5rem;\">\n                Common variety of tokens per chunk and the full variety of<br \/>\n                chunks after segmenting GutenQA<\/caption>\n<thead>\n<tr>\n<th style=\"text-align: left;\">Technique<\/th>\n<th>Avg. #Tokens \/ Chunk<\/th>\n<th>Complete #Chunks<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Semantic Chunking<\/td>\n<td>185 tokens<\/td>\n<td>191059<\/td>\n<\/tr>\n<tr>\n<td>Paragraph Stage<\/td>\n<td>79 tokens<\/td>\n<td>248307<\/td>\n<\/tr>\n<tr>\n<td>Recursive Chunking<\/td>\n<td>399 tokens<\/td>\n<td>31787<\/td>\n<\/tr>\n<tr>\n<td>Proposition-Stage<\/td>\n<td>12 tokens<\/td>\n<td>914493<\/td>\n<\/tr>\n<tr style=\"background-color: #f0f8ff;\">\n<td style=\"text-align: left;\">LumberChunker<\/td>\n<td><strong>334 tokens<\/strong><\/td>\n<td><strong>36917<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/div>\n<h2>Conclusion<\/h2>\n<p>LumberChunker reframes doc chunking as a semantic boundary detection downside. As an alternative of counting on mounted token limits or floor construction, it makes use of a rolling context window to establish the earliest level the place the which means of the textual content turns into impartial from what got here earlier than, producing segments that higher align with the underlying narrative construction.<\/p>\n<p>On the GutenQA benchmark, LumberChunker persistently improves retrieval and downstream QA over conventional fixed-size and recursive strategies, approaching the standard of handbook, human-curated segmentations.<\/p>\n<p>These outcomes counsel that segmentation isn&#8217;t just a preprocessing step, however a core design selection for retrieval techniques. By creating semantically impartial chunks, LumberChunker supplies a sensible approach to enhance how long-form paperwork are retrieved and utilized in RAG pipelines.<\/p>\n<h2>Quotation<\/h2>\n<p>If you happen to discover LumberChunker helpful in your analysis, please think about citing:<\/p>\n<pre class=\"wp-block-preformatted\">@inproceedings{duarte-etal-2024-lumberchunker,\n    title = \"{L}umber{C}hunker: Lengthy-Type Narrative Doc Segmentation\",\n    creator = \"Duarte, Andr{'e} V.  and Marques, Jo{~a}o DS  and Gra{c{c}}a, Miguel  and Freire, Miguel  and Li, Lei  and Oliveira, Arlindo L.\",\n    editor = \"Al-Onaizan, Yaser  and Bansal, Mohit  and Chen, Yun-Nung\",\n    booktitle = \"Findings of the Affiliation for Computational Linguistics: EMNLP 2024\",\n    month = nov,\n    yr = \"2024\",\n    deal with = \"Miami, Florida, USA\",\n    writer = \"Affiliation for Computational Linguistics\",\n    url = \"https:\/\/aclanthology.org\/2024.findings-emnlp.377\/\",\n    doi = \"10.18653\/v1\/2024.findings-emnlp.377\",\n    pages = \"6473--6486\",\n    summary = \"LumberChunker reframes doc chunking as a semantic boundary detection downside...\"\n}<\/pre>\n<hr class=\"wp-block-separator\"\/>\n<p>Weblog created by <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/raymondjiang0917\/\" target=\"_blank\" rel=\"noreferrer noopener\">Raymond Jiang<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/avduarte333.github.io\/\" data-type=\"URL\" data-id=\"https:\/\/avduarte333.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Andr\u00e9 Duarte<\/a><\/p>\n<\/p><\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Hyperlinks:Paper | Code | Knowledge LumberChunker lets an LLM determine the place a protracted story must be cut up, creating extra pure chunks that assist Retrieval Augmented Era (RAG) techniques retrieve the proper info. Introduction Lengthy-form narrative paperwork often have an specific construction, reminiscent of chapters or sections, however these models are sometimes too broad [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":12925,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[110,4534,136,4961,8314,113,442,5766,7199],"class_list":["post-12923","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-blog","tag-document","tag-learning","tag-longform","tag-lumberchunker","tag-machine","tag-mlcmu","tag-narrative","tag-segmentation"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12923","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=12923"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12923\/revisions"}],"predecessor-version":[{"id":12924,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/12923\/revisions\/12924"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/12925"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=12923"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=12923"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=12923"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-05 07:57:13 UTC -->