{"id":15688,"date":"2026-06-13T09:30:21","date_gmt":"2026-06-13T09:30:21","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=15688"},"modified":"2026-06-13T09:30:21","modified_gmt":"2026-06-13T09:30:21","slug":"when-pymupdf-cant-see-the-desk-parse-pdfs-for-rag-with-azure-structure","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=15688","title":{"rendered":"When PyMuPDF Can\u2019t See the Desk: Parse PDFs for RAG with Azure Structure"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p class=\"wp-block-paragraph\"> companion in <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/document-intelligence-a-series-on-building-rag-brick-by-brick-from-minimal-to-corpus-scale\/\">Enterprise Doc Intelligence<\/a>, the collection that builds an enterprise RAG system from 4 bricks. <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/beyond-extract_text-the-two-layers-of-a-pdf-that-drive-rag-quality\/\">Article 5 (doc parsing)<\/a> constructed the parser with PyMuPDF (fitz). This companion retains the identical objective and the identical relational tables, and swaps the engine for <strong>Azure Structure<\/strong> (the <code>prebuilt-layout<\/code> mannequin), a richer package deal that recovers what fitz can not. That hole is the place we begin.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-132-1024x572.png\" alt=\"\" class=\"wp-image-666831\"\/><figcaption class=\"wp-element-caption\"><em>the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), with a special parsing engine \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">PyMuPDF (fitz) is quick, free, and actual on clear prose. It additionally goes blind in three locations, and each is the place enterprise RAG quietly breaks.<\/p>\n<p class=\"wp-block-paragraph\">The desk on web page 14 of a contract. Fitz reads the cells one after the other and concatenates them. The column construction is gone. <em>\u201cRenewal charge 500 Setup charge 200\u201d<\/em> lands within the chunk. Your mannequin is requested to guess which quantity is which charge.<\/p>\n<p class=\"wp-block-paragraph\">The scanned modification glued to the tip of the doc. Fitz reads the native pages and returns empty strings on the scanned ones. The person will get no reply on the modification as a result of the parser by no means learn it.<\/p>\n<p class=\"wp-block-paragraph\">The determine with textual content inside. A chart with axis labels. A signed seal stamp. A screenshot of a spreadsheet. Fitz returns the bbox of the picture. The textual content inside is gone.<\/p>\n<p class=\"wp-block-paragraph\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/products\/ai-services\/ai-document-intelligence\/\">Azure Doc Intelligence<\/a> reads all three. It\u2019s a proprietary Microsoft Azure cloud service ruled by <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.microsoft.com\/licensing\/terms\/product\/ForOnlineServices\/all\">Microsoft\u2019s On-line Companies Phrases<\/a>. The <code>prebuilt-layout<\/code> mannequin returns native desk cells (rows, columns, headers), OCR textual content for each web page (native or scanned), figures with the textual content inside them, and paragraph roles (<code>title<\/code>, <code>sectionHeading<\/code>, <code>figureCaption<\/code>, <code>tableCaption<\/code>). One name. The identical relational tables as fitz, half of them enriched.<\/p>\n<p class=\"wp-block-paragraph\">The downstream pipeline doesn&#8217;t care which engine produced the dict. Retrieval, technology, annotation learn rows. They by no means learn the PDF.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-133-1024x617.png\" alt=\"\" class=\"wp-image-666832\"\/><figcaption class=\"wp-element-caption\"><em>The identical tables, Azure enriches half \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">1. The place fitz is blind<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#where-fitz-is-blind\"\/><\/h2>\n<p class=\"wp-block-paragraph\">4 circumstances. In each, fitz misses and Azure works.<\/p>\n<h3 class=\"wp-block-heading\">1.1. Tables: fitz returns flat phrases, Azure returns cells<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#tables-fitz-returns-flat-words-azure-returns-cells\"\/><\/h3>\n<p class=\"wp-block-paragraph\">A contract desk has rows and columns. The label <em>\u201cRenewal charge\u201d<\/em> sits in column 1, the worth <em>500<\/em> sits in column 2. Fitz reads the web page high to backside and emits one line per textual content section. The 4 cells of a row come again as 4 free phrases. Generally the cells from the row under get combined in if the y-coordinates are shut. The chunker downstream sees a soup of phrases. The row-and-column construction that makes a desk a desk is gone.<\/p>\n<p class=\"wp-block-paragraph\">Azure\u2019s <code>prebuilt-layout<\/code> mannequin detects every desk as a structured object. <code>consequence.tables<\/code> is an inventory of tables, every with <code>cells<\/code> listed by <code>(row_index, column_index)<\/code>. The header row is flagged (<code>cell.sort == \"columnHeader\"<\/code>). The cell content material is the cell textual content, precisely because the creator typed it. We flatten the desk into markdown rows so it lives inside <code>line_df<\/code> like some other content material. A four-cell row <em>\u201cRenewal charge | 500 | Setup charge | 200\u201d<\/em> turns into one <code>line_df<\/code> row with that markdown textual content. The header row will get a <code>| --- | --- | ... |<\/code> separator so a downstream mannequin reads the construction again.<\/p>\n<h3 class=\"wp-block-heading\">1.2. Pictures: fitz returns the bbox, Azure returns the textual content<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#images-fitz-returns-the-bbox-azure-returns-the-text\"\/><\/h3>\n<p class=\"wp-block-paragraph\">Many PDFs have figures with textual content inside them. Structure diagrams with field labels. Charts with axis ticks and legends. Signed seal stamps. Embedded screenshots of spreadsheets. Fitz returns every picture as a bbox and the uncooked bytes. The textual content inside is invisible to the parser.<\/p>\n<p class=\"wp-block-paragraph\">Azure\u2019s OCR runs on each web page, together with the pixels inside determine areas. For every determine, we acquire each Azure phrase whose bbox sits contained in the determine area and be part of them as <code>ocr_text<\/code>. <em>\u201cMulti-Head Consideration Concat Linear h\u201d<\/em> now lives in <code>image_df.ocr_text<\/code> for the determine on web page 4 of the Consideration paper. Retrieval can match a query about <em>\u201cmulti-head consideration\u201d<\/em> even when the reply is textual content inside a determine.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-134-1024x428.png\" alt=\"\" class=\"wp-image-666833\"\/><figcaption class=\"wp-element-caption\"><em>fitz returns the bbox and an empty textual content cell; Azure\u2019s OCR recovers the labels printed contained in the determine \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">1.3. Scanned pages: fitz returns nothing, Azure returns OCR<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#scanned-pages-fitz-returns-nothing-azure-returns-ocr\"\/><\/h3>\n<p class=\"wp-block-paragraph\">A 30-page native contract will get a 10-page scanned modification glued on the finish. Fitz reads the native pages and returns empty strings for the scanned ones. The parser doesn&#8217;t flag this. The downstream pipeline silently covers 75% of the doc. The person has no concept 25% is lacking.<\/p>\n<p class=\"wp-block-paragraph\">Azure runs OCR on each web page no matter supply. Native pages and scanned pages come again via the identical <code>consequence.pages[i].traces<\/code> path with the identical form. The <code>parsing_method<\/code> column on <code>line_df<\/code> lets downstream code inform which engine produced which rows. The <code>parsing_summary<\/code> dict has a <code>n_pages<\/code> area that matches the doc\u2019s precise web page depend, not simply the pages with native textual content.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-135-1024x559.png\" alt=\"\" class=\"wp-image-666834\"\/><figcaption class=\"wp-element-caption\"><em>a scan is pixels, not characters; fitz has no textual content layer to learn, Azure OCRs the web page \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">1.4. Captions and headings: fitz makes use of regex, Azure has express roles<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#captions-and-headings-fitz-uses-regex-azure-has-explicit-roles\"\/><\/h3>\n<p class=\"wp-block-paragraph\">Fitz detects determine \/ desk captions by regex on the beginning of every line (<code>^Determine d+b<\/code>, <code>^Desk d+b<\/code>). It really works when captions appear to be <em>\u201cDetermine 2\u201d<\/em> and misses the remaining (<em>\u201cFig. 2\u201d<\/em>, multi-line wraps). It additionally has false positives: a body-text sentence that begins with <em>\u201cDetermine 2\u201d<\/em> will get picked up as a caption when it&#8217;s a point out.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-136-1024x358.png\" alt=\"\" class=\"wp-image-666835\"\/><figcaption class=\"wp-element-caption\"><em>the 2 failure modes of caption-by-regex (a missed \u201cFig.\u201d caption, a physique point out wrongly flagged) that Azure\u2019s paragraph function avoids \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Azure\u2019s <code>paragraphs<\/code> area has function labels: every paragraph within the consequence carries a tag like <code>\"figureCaption\"<\/code>, <code>\"tableCaption\"<\/code>, <code>\"title\"<\/code>, or <code>\"sectionHeading\"<\/code> that tells us what sort of block it&#8217;s, with none regex. <code>\"figureCaption\"<\/code> and <code>\"tableCaption\"<\/code> populate <code>object_registry<\/code> immediately. <code>\"title\"<\/code> and <code>\"sectionHeading\"<\/code> rebuild the TOC. The tag is Azure\u2019s structure mannequin naming the block\u2019s operate; fitz has no equal. The <code>(object_type, object_id)<\/code> be part of key remains to be extracted by the identical regex on the caption textual content so <code>cross_ref_df<\/code> joins again the identical approach.<\/p>\n<p class=\"wp-block-paragraph\">The TOC is the extra attention-grabbing case. Fitz\u2019s <code>build_toc_df<\/code> reads native bookmarks (<code>doc.get_toc()<\/code>). When the PDF has no native bookmarks, fitz returns an empty TOC. That is the widespread enterprise case: Phrase exports, scanned paperwork, PDFs from kind turbines. Azure reconstructs the TOC from paragraph roles. Each <code>\"title\"<\/code> paragraph turns into a level-1 entry, each <code>\"sectionHeading\"<\/code> paragraph turns into level-2. The hierarchy comes from the order they seem. This isn&#8217;t excellent, nevertheless it produces a usable TOC the place fitz would produce nothing.<\/p>\n<h2 class=\"wp-block-heading\">2. Identical contract, richer information<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#same-contract-richer-data\"\/><\/h2>\n<p class=\"wp-block-paragraph\">One operate. The identical tables as <code>parse_pdf<\/code>, in the identical form. One Azure name shared by each builder. That decision is small: level the SDK on the doc with one <code>model_id<\/code>, <code>prebuilt-layout<\/code>. (The opposite prebuilt mannequin, <code>prebuilt-read<\/code>, is OCR solely; the structure mannequin is the one which additionally returns tables, paragraph roles, and studying order.)<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from azure.ai.documentintelligence import DocumentIntelligenceClient\nfrom azure.ai.documentintelligence.fashions import AnalyzeDocumentRequest\nfrom azure.core.credentials import AzureKeyCredential\n\nconsumer = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))\n\n# \"Structure\" = the prebuilt-layout mannequin (NOT prebuilt-read, which is OCR solely)\nwith open(\"contract.pdf\", \"rb\") as f:\n    poller = consumer.begin_analyze_document(\n        \"prebuilt-layout\",\n        AnalyzeDocumentRequest(bytes_source=f.learn()),\n    )\n\nconsequence = poller.consequence()   # tables, paragraph roles, OCR, studying order<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><code>parse_pdf_azure_layout<\/code> is the Azure twin of <code>parse_pdf<\/code>: identical name form, identical dict of tables out, so each downstream brick reads it with out figuring out which engine ran. The physique is value a glance, as a result of it&#8217;s the form each engine within the collection follows: make one name, then one small builder per desk, and reuse the engine-agnostic builders for the tables that solely want <code>line_df<\/code>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def parse_pdf_azure_layout(pdf_path):\n    consequence = analyze_pdf(pdf_path)               # one name, prebuilt-layout\n    line_df  = azure_layout_pdf_to_line_df(pdf_path, consequence=consequence)\n    image_df = build_image_df_azure_layout(consequence)         # + ocr_text\n    toc_df   = build_toc_df_azure_layout(consequence)           # paragraph roles\n    object_registry = build_object_registry_azure_layout(consequence)  # function tags\n    page_df      = build_page_df(line_df)        # reused fitz builder (line_df solely)\n    cross_ref_df = build_cross_ref_df(line_df)   # reused fitz builder (line_df solely)\n    return {\"line_df\": line_df, \"image_df\": image_df, \"toc_df\": toc_df,\n            \"object_registry\": object_registry, \"page_df\": page_df,\n            \"cross_ref_df\": cross_ref_df, \"span_df\": pd.DataFrame(),\n            \"parsing_summary\": parsing_summary}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Studying it high to backside: one <code>analyze_pdf<\/code> makes the Azure name as soon as, then one small builder per desk reads that shared <code>consequence<\/code>, and the 2 tables that solely want <code>line_df<\/code>, <code>page_df<\/code> and <code>cross_ref_df<\/code>, are produced by the exact same fitz builders the native parser makes use of. The dict on the finish is the contract each engine returns.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-137-1024x660.png\" alt=\"\" class=\"wp-image-666843\"\/><figcaption class=\"wp-element-caption\"><em>The identical tables mirror <code>parse_pdf<\/code>, with per-row diffs vs fitz \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">3. What every desk beneficial properties<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#what-each-table-gains\"\/><\/h2>\n<h3 class=\"wp-block-heading\">3.1. line_df beneficial properties table-cell rows, picture OCR, choice marks<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#line_df-gains-table-cell-rows-image-ocr-selection-marks\"\/><\/h3>\n<p class=\"wp-block-paragraph\">A 4-column \u201cSchedule of Costs\u201d desk turns into 6 rows in <code>line_df<\/code>: the header row, the markdown separator, and 4 information rows.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-138-1024x641.png\" alt=\"\" class=\"wp-image-666844\"\/><figcaption class=\"wp-element-caption\"><em>Every supply row turns into a <code>line_df<\/code> row; column construction carried contained in the markdown textual content \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We hold the cells inside <code>line_df<\/code> as an alternative of including a separate <code>table_cells_df<\/code>. One desk for each downstream brick to learn; paragraph traces and desk rows look the identical on the best way out. The fee: per-cell queries want a markdown parse step. For RAG questions that is superb. The retriever matches key phrases on the row textual content. The LLM reads the markdown immediately.<\/p>\n<p class=\"wp-block-paragraph\">OCR textual content from inside photographs additionally lands in <code>line_df<\/code> as further rows. Azure\u2019s <code>consequence.pages[i].traces<\/code> already contains traces that fall inside determine areas, so the line-builder picks them up robotically. Choice marks (checkboxes) turn into single-character traces: <code>[x]<\/code> for chosen, <code>[ ]<\/code> for unselected. Varieties with check-the-box fields turn into queryable.<\/p>\n<h3 class=\"wp-block-heading\">3.2. image_df beneficial properties an <code>ocr_text<\/code> column<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#image_df-gains-an-ocr_text-column\"\/><\/h3>\n<p class=\"wp-block-paragraph\">Identical row, new column. For every detected determine, we listing each Azure phrase whose bbox overlaps the determine area by not less than 50% and be part of them as <code>ocr_text<\/code>.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-139-1024x277.png\" alt=\"\" class=\"wp-image-666845\"\/><figcaption class=\"wp-element-caption\"><em>Consideration paper figures with their labels uncovered; textual content inside figures now retrievable \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The identical column on a fitz-produced <code>image_df<\/code> is empty. The fitz parser doesn&#8217;t OCR photographs. When <code>parsing_method == \"fitz\"<\/code>, the <code>ocr_text<\/code> column is there for form parity however stays clean. Downstream code that checks <code>ocr_text != \"\"<\/code> works the identical whether or not the row got here from fitz or Azure.<\/p>\n<h3 class=\"wp-block-heading\">3.3. toc_df will get reconstructed from paragraph roles<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#toc_df-gets-reconstructed-from-paragraph-roles\"\/><\/h3>\n<p class=\"wp-block-paragraph\">When the PDF has native bookmarks, the fitz <code>build_toc_df<\/code> is actual and free: it reads what the creator wrote. When it doesn\u2019t (most enterprise paperwork), fitz returns an empty <code>toc_df<\/code> and downstream levels lose the part construction.<\/p>\n<p class=\"wp-block-paragraph\">The Azure builder walks <code>consequence.paragraphs<\/code>, filters by function in <code>{\"title\", \"sectionHeading\"}<\/code>, and assembles a TOC. Stage 1 = title, degree 2 = sectionHeading. The hierarchy comes from the order paragraphs seem within the doc. The identical <code>start_page<\/code>, <code>end_page<\/code>, <code>start_y<\/code>, <code>breadcrumb<\/code> columns because the fitz TOC. The lookback move that computes <code>end_page<\/code> (the subsequent peer-or-ancestor\u2019s <code>start_page<\/code>, or <code>total_pages<\/code> for the final part) is an identical to the fitz one; the one distinction is the place the rows come from.<\/p>\n<p class=\"wp-block-paragraph\">The reconstruction shouldn&#8217;t be excellent. Azure can not inform sub-section ranges aside past <code>sectionHeading<\/code>. The hierarchy you get is two-deep at most. For many enterprise queries that is sufficient: a piece stamped <em>\u201cSchedule of Costs\u201d<\/em> lets the LLM floor its reply to the correct part even with out the complete <em>Article 14 &gt; Schedule of Costs<\/em> path.<\/p>\n<h3 class=\"wp-block-heading\">3.4. object_registry will get caption-role detection<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#object_registry-gets-caption-role-detection\"\/><\/h3>\n<p class=\"wp-block-paragraph\">Fitz detects captions by regex anchored at the beginning of a line: <code>^Determine d+b<\/code>, <code>^Desk d+b<\/code>. Two failure modes. False negatives when the caption format differs (<code>Fig. 2.<\/code> as an alternative of <code>Determine 2<\/code>, or a multi-line wrap that pushes the quantity off the primary line). False positives when a body-text sentence occurs to begin with <em>\u201cDetermine 2 reveals\u2026\u201d<\/em>.<\/p>\n<p class=\"wp-block-paragraph\">Azure skips the regex drawback. Its <code>paragraphs<\/code> area tags <code>\"figureCaption\"<\/code> and <code>\"tableCaption\"<\/code> explicitly. We learn the function immediately. The <code>(object_type, object_id)<\/code> be part of key into <code>cross_ref_df<\/code> remains to be pulled from the caption textual content by the identical regex the fitz builder makes use of, so the be part of works the identical with both engine. The win is recall: Azure catches captions fitz misses. The fee stays the identical (one Azure name, the result&#8217;s reused throughout builders).<\/p>\n<h3 class=\"wp-block-heading\">3.5. parsing_summary beneficial properties Azure-specific stats<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#parsing_summary-gains-azure-specific-stats\"\/><\/h3>\n<p class=\"wp-block-paragraph\">Three new fields land within the doc-level synthesis dict:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><code>n_tables_detected<\/code>: what number of tables Azure discovered (zero on a pure-prose doc, non-zero on a contract with tables).<\/li>\n<li class=\"wp-block-list-item\"><code>n_figures<\/code>: what number of figures the structure mannequin recognized.<\/li>\n<li class=\"wp-block-list-item\"><code>n_selection_marks<\/code>: what number of checkboxes (stuffed or empty) Azure detected throughout all pages.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">These three counts make routing a doc simple. A 30-page doc with <code>n_tables_detected = 18<\/code> seems like a contract and the desk construction issues. A doc with <code>n_selection_marks = 0<\/code> might be not a kind. A doc with <code>n_figures = 0<\/code> is text-only; no level working picture OCR.<\/p>\n<h3 class=\"wp-block-heading\">3.6. page_df and cross_ref_df: unchanged<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#page_df-and-cross_ref_df-unchanged\"\/><\/h3>\n<p class=\"wp-block-paragraph\">Two tables keep the identical form. <code>page_df<\/code> and <code>cross_ref_df<\/code> are constructed from <code>line_df<\/code> alone, so the engine that produced <code>line_df<\/code> is irrelevant. One implementation, two engines, no drift.<\/p>\n<p class=\"wp-block-paragraph\"><code>span_df<\/code> is empty beneath Azure. The structure mannequin doesn&#8217;t expose sub-line typography (per-word daring or italic). If you want spans for heading detection or time period emphasis, keep on fitz for that doc. The 2 engines complement one another.<\/p>\n<h2 class=\"wp-block-heading\">4. The parsing_method column: provenance for adaptive parsing<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#the-parsing_method-column-provenance-for-adaptive-parsing\"\/><\/h2>\n<p class=\"wp-block-paragraph\">Each per-row desk from <code>parse_pdf_azure_layout<\/code> carries <code>parsing_method == \"azure_layout\"<\/code>. Each per-row desk from <code>parse_pdf<\/code> (the fitz one) carries <code>parsing_method == \"fitz\"<\/code>. Identical column, identical title, each engines. The purpose is downstream.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-140-1024x618.png\" alt=\"\" class=\"wp-image-666846\"\/><figcaption class=\"wp-element-caption\"><em>Contract on fitz, web page 14 re-parsed with Azure; each engines coexist by way of <code>parsing_method<\/code> \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">That is what adaptive parsing (Article 10) consumes. The default move makes use of fitz. Pages that fail a pre-parse verify (desk area detected with no rows extracted, image-heavy web page with sparse textual content, OCR layer with low high quality) get re-parsed by Azure. The re-parsed rows exchange or append to the unique <code>line_df<\/code> rows. The <code>parsing_method<\/code> column retains the path.<\/p>\n<p class=\"wp-block-paragraph\">Three downstream patterns the column allows:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>De-duplication<\/strong>: when the identical web page bought each passes, hold azure rows over fitz rows (<code>df.sort_values(\"parsing_method\").drop_duplicates([\"page_num\", \"line_num\"], hold=\"first\")<\/code> if <code>\"azure_layout\" &lt; \"fitz\"<\/code> lexicographically, or use an express priority map).<\/li>\n<li class=\"wp-block-list-item\"><strong>Audit<\/strong>: a query that lands on a row with <code>parsing_method == \"azure_layout\"<\/code> prices extra to confirm (Azure was wanted). The reply\u2019s confidence weighting can use this.<\/li>\n<li class=\"wp-block-list-item\"><strong>Price accounting<\/strong>: <code>(line_df.parsing_method == \"azure_layout\").any()<\/code> per web page tells you which of them pages went via Azure and how one can invoice the parsing time.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">5. Price and latency<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#cost-and-latency\"\/><\/h2>\n<p class=\"wp-block-paragraph\">Azure shouldn&#8217;t be free. Three numbers matter.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Latency<\/strong>: one web page via <code>prebuilt-layout<\/code> returns in 2 to 4 seconds. A 30-page doc takes 60 to 120 seconds. Fitz parses the identical doc in beneath a second. When the person is ready for a question, parse with fitz first. Escalate to Azure solely on pages fitz dealt with poorly.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Cash<\/strong>: Azure costs per web page. The <code>prebuilt-layout<\/code> tier is round US$10 per 1,000 pages as we speak. A 30-page contract prices roughly US$0.30. Parsing 1,000 such contracts a day is US$300\/day if each web page goes via Azure. Limiting Azure to the pages that want it brings this down by 10x or extra.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Limits<\/strong>: the per-call PDF measurement restrict is 500 MB or 2,000 pages, whichever comes first. Bigger paperwork should be break up. The free tier (F0) permits 500 pages per thirty days and is ok for growth. Manufacturing often wants S0.<\/p>\n<p class=\"wp-block-paragraph\">The order of magnitude is secure: fitz is free, Azure prices roughly a cent per web page. The precise tier costs change with area and time: deal with the numbers above as a calibration, not a contract. Article 10 picks which engine runs.<\/p>\n<h2 class=\"wp-block-heading\">6. When to name which<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#when-to-call-which\"\/><\/h2>\n<p class=\"wp-block-paragraph\">Default to fitz. Escalate to Azure when a particular sign says fitz shouldn&#8217;t be sufficient.<\/p>\n<p class=\"wp-block-paragraph\">Three indicators value wiring:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>The web page has a desk area however fitz extracted few or no row-like constructions.<\/strong> Compute on <code>line_df<\/code>: cluster traces by y-coordinate, search for runs of brief uniform-spaced traces (an indication of cells). If the web page metadata says \u201cdesk detected\u201d (from fitz\u2019s <code>web page.find_tables()<\/code>) however the line sample doesn&#8217;t look table-like, escalate.<\/li>\n<li class=\"wp-block-list-item\"><strong>The web page is image-heavy with sparse textual content.<\/strong> <code>image_df<\/code> for the web page covers greater than 80% of the web page space and <code>line_df<\/code> has fewer than 10 rows on that web page. Scanned web page with no OCR layer, or a web page that&#8217;s one massive diagram with textual content inside. Both case wants Azure.<\/li>\n<li class=\"wp-block-list-item\"><strong>The OCR high quality rating is low:<\/strong> When fitz\u2019s <code>web page.get_text(\"textual content\")<\/code> returns scrambled OCR (excessive ratio of Unicode alternative characters, low dictionary-word ratio), re-OCR with Azure. The <code>text_quality_score<\/code> is computed in <code>pre_parse_signals<\/code> and browse by the dispatcher.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">A fourth sign is easier. If the doc has no native TOC (<code>fitz.toc_df.empty<\/code>) and technology wants part context, run the doc as soon as via Azure to get a reconstructed TOC. One price per doc, not per question.<\/p>\n<p class=\"wp-block-paragraph\">Article 10 builds the complete dispatcher. The <code>parsing_method<\/code> column is what lets each downstream stage learn which engine ran on which row.<\/p>\n<h2 class=\"wp-block-heading\">7. Conclusion<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#conclusion\"\/><\/h2>\n<p class=\"wp-block-paragraph\">Two engines, one contract: the identical relational tables out, identical downstream code no matter which one ran.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/06\/image-141-1024x923.png\" alt=\"\" class=\"wp-image-666847\"\/><figcaption class=\"wp-element-caption\"><em>Each functionality that issues for enterprise RAG, plus velocity and value \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">A parser doesn&#8217;t return textual content; it returns a mannequin of the doc. Azure makes that mannequin richer (cell-level tables, OCR inside figures, captions tagged by function, TOC reconstructed with out bookmarks) at 2 to 4 seconds and ~US$0.01 per web page. Fitz prices nothing and runs in milliseconds. The routing rule is easy: fitz by default, Azure when an upstream sign says fitz shouldn&#8217;t be sufficient. Article 10 wires the dispatcher.<\/p>\n<h2 class=\"wp-block-heading\">8. Sources and additional studying<a rel=\"nofollow\" target=\"_blank\" href=\"file:\/\/\/C:\/Users\/shike\/Documents\/Github\/rag\/book\/_rendered\/05_2_azure_di_parsing.html#sources-and-further-reading\"\/><\/h2>\n<p class=\"wp-block-paragraph\">The <code>prebuilt-layout<\/code> mannequin behind <code>parse_pdf_azure_layout<\/code> is documented by Microsoft and rests on cell-level desk extraction analysis (Smock et al.\u00a02022) plus a paragraph-role layer that converts visible areas into structural roles. Docling (Article 5ter) is the open-source equal of the identical cascade; it provides the identical desk contract on native {hardware}, helpful when paperwork can not depart the constructing.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Identical route because the article:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Microsoft, <em><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/azure\/ai-services\/document-intelligence\/prebuilt\/layout\">Azure AI Doc Intelligence. Structure mannequin<\/a><\/em>. Official documentation for <code>prebuilt-layout<\/code>, the mannequin behind <code>parse_pdf_azure_layout<\/code>. The cell-level desk output, paragraph roles, and OCR protection all originate right here.<\/li>\n<li class=\"wp-block-list-item\">Smock, Pesala, Abraham, <em>PubTables-1M \/ Desk Transformer (TATR)<\/em>, CVPR 2022 (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2110.00061\">arXiv:2110.00061<\/a>). The analysis behind the cell-level desk extraction Azure ships; helpful for understanding what <code>azure_layout<\/code> is doing beneath the hood.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>Completely different angle, totally different context:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Auer et al., <em>Docling Technical Report<\/em>, IBM Analysis 2024 (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2408.09869\">arXiv:2408.09869<\/a>). Open-source native equal of the Azure structure cascade. Identical desk contract; trades cloud price for native compute. The suitable selection when confidentiality blocks the cloud add that Azure requires.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>Earlier within the collection:<\/strong><\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>companion in Enterprise Doc Intelligence, the collection that builds an enterprise RAG system from 4 bricks. Article 5 (doc parsing) constructed the parser with PyMuPDF (fitz). This companion retains the identical objective and the identical relational tables, and swaps the engine for Azure Structure (the prebuilt-layout mannequin), a richer package deal that recovers what fitz [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":15690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[1126,2977,9408,4114,9406,1729,9407],"class_list":["post-15688","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-azure","tag-layout","tag-parse","tag-pdfs","tag-pymupdf","tag-rag","tag-table"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15688"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688\/revisions"}],"predecessor-version":[{"id":15689,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688\/revisions\/15689"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15690"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15688"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15688"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15688"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-06-13 11:58:41 UTC -->