When PyMuPDF Can’t See the Desk: Parse PDFs for RAG with Azure Structure

companion in Enterprise Doc Intelligence, the collection that builds an enterprise RAG system from 4 bricks. Article 5 (doc parsing) constructed the parser with PyMuPDF (fitz). This companion retains the identical objective and the identical relational tables, and swaps the engine for Azure Structure (the prebuilt-layout mannequin), a richer package deal that recovers what fitz can not. That hole is the place we begin.

*the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), with a special parsing engine – Picture by creator*

PyMuPDF (fitz) is quick, free, and actual on clear prose. It additionally goes blind in three locations, and each is the place enterprise RAG quietly breaks.

The desk on web page 14 of a contract. Fitz reads the cells one after the other and concatenates them. The column construction is gone. “Renewal charge 500 Setup charge 200” lands within the chunk. Your mannequin is requested to guess which quantity is which charge.

The scanned modification glued to the tip of the doc. Fitz reads the native pages and returns empty strings on the scanned ones. The person will get no reply on the modification as a result of the parser by no means learn it.

The determine with textual content inside. A chart with axis labels. A signed seal stamp. A screenshot of a spreadsheet. Fitz returns the bbox of the picture. The textual content inside is gone.

Azure Doc Intelligence reads all three. It’s a proprietary Microsoft Azure cloud service ruled by Microsoft’s On-line Companies Phrases. The prebuilt-layout mannequin returns native desk cells (rows, columns, headers), OCR textual content for each web page (native or scanned), figures with the textual content inside them, and paragraph roles (title, sectionHeading, figureCaption, tableCaption). One name. The identical relational tables as fitz, half of them enriched.

The downstream pipeline doesn’t care which engine produced the dict. Retrieval, technology, annotation learn rows. They by no means learn the PDF.

*The identical tables, Azure enriches half – Picture by creator*

1. The place fitz is blind

4 circumstances. In each, fitz misses and Azure works.

1.1. Tables: fitz returns flat phrases, Azure returns cells

A contract desk has rows and columns. The label “Renewal charge” sits in column 1, the worth 500 sits in column 2. Fitz reads the web page high to backside and emits one line per textual content section. The 4 cells of a row come again as 4 free phrases. Generally the cells from the row under get combined in if the y-coordinates are shut. The chunker downstream sees a soup of phrases. The row-and-column construction that makes a desk a desk is gone.

Azure’s prebuilt-layout mannequin detects every desk as a structured object. consequence.tables is an inventory of tables, every with cells listed by (row_index, column_index). The header row is flagged (cell.sort == "columnHeader"). The cell content material is the cell textual content, precisely because the creator typed it. We flatten the desk into markdown rows so it lives inside line_df like some other content material. A four-cell row “Renewal charge | 500 | Setup charge | 200” turns into one line_df row with that markdown textual content. The header row will get a | --- | --- | ... | separator so a downstream mannequin reads the construction again.

1.2. Pictures: fitz returns the bbox, Azure returns the textual content

Many PDFs have figures with textual content inside them. Structure diagrams with field labels. Charts with axis ticks and legends. Signed seal stamps. Embedded screenshots of spreadsheets. Fitz returns every picture as a bbox and the uncooked bytes. The textual content inside is invisible to the parser.

Azure’s OCR runs on each web page, together with the pixels inside determine areas. For every determine, we acquire each Azure phrase whose bbox sits contained in the determine area and be part of them as ocr_text. “Multi-Head Consideration Concat Linear h” now lives in image_df.ocr_text for the determine on web page 4 of the Consideration paper. Retrieval can match a query about “multi-head consideration” even when the reply is textual content inside a determine.

*fitz returns the bbox and an empty textual content cell; Azure’s OCR recovers the labels printed contained in the determine – Picture by creator*

1.3. Scanned pages: fitz returns nothing, Azure returns OCR

A 30-page native contract will get a 10-page scanned modification glued on the finish. Fitz reads the native pages and returns empty strings for the scanned ones. The parser doesn’t flag this. The downstream pipeline silently covers 75% of the doc. The person has no concept 25% is lacking.

Azure runs OCR on each web page no matter supply. Native pages and scanned pages come again via the identical consequence.pages[i].traces path with the identical form. The parsing_method column on line_df lets downstream code inform which engine produced which rows. The parsing_summary dict has a n_pages area that matches the doc’s precise web page depend, not simply the pages with native textual content.

*a scan is pixels, not characters; fitz has no textual content layer to learn, Azure OCRs the web page – Picture by creator*

1.4. Captions and headings: fitz makes use of regex, Azure has express roles

Fitz detects determine / desk captions by regex on the beginning of every line (^Determine d+b, ^Desk d+b). It really works when captions appear to be “Determine 2” and misses the remaining (“Fig. 2”, multi-line wraps). It additionally has false positives: a body-text sentence that begins with “Determine 2” will get picked up as a caption when it’s a point out.

*the 2 failure modes of caption-by-regex (a missed “Fig.” caption, a physique point out wrongly flagged) that Azure’s paragraph function avoids – Picture by creator*

Azure’s paragraphs area has function labels: every paragraph within the consequence carries a tag like "figureCaption", "tableCaption", "title", or "sectionHeading" that tells us what sort of block it’s, with none regex. "figureCaption" and "tableCaption" populate object_registry immediately. "title" and "sectionHeading" rebuild the TOC. The tag is Azure’s structure mannequin naming the block’s operate; fitz has no equal. The (object_type, object_id) be part of key remains to be extracted by the identical regex on the caption textual content so cross_ref_df joins again the identical approach.

The TOC is the extra attention-grabbing case. Fitz’s build_toc_df reads native bookmarks (doc.get_toc()). When the PDF has no native bookmarks, fitz returns an empty TOC. That is the widespread enterprise case: Phrase exports, scanned paperwork, PDFs from kind turbines. Azure reconstructs the TOC from paragraph roles. Each "title" paragraph turns into a level-1 entry, each "sectionHeading" paragraph turns into level-2. The hierarchy comes from the order they seem. This isn’t excellent, nevertheless it produces a usable TOC the place fitz would produce nothing.

2. Identical contract, richer information

One operate. The identical tables as parse_pdf, in the identical form. One Azure name shared by each builder. That decision is small: level the SDK on the doc with one model_id, prebuilt-layout. (The opposite prebuilt mannequin, prebuilt-read, is OCR solely; the structure mannequin is the one which additionally returns tables, paragraph roles, and studying order.)

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.fashions import AnalyzeDocumentRequest
from azure.core.credentials import AzureKeyCredential

consumer = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))

# "Structure" = the prebuilt-layout mannequin (NOT prebuilt-read, which is OCR solely)
with open("contract.pdf", "rb") as f:
    poller = consumer.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(bytes_source=f.learn()),
    )

consequence = poller.consequence()   # tables, paragraph roles, OCR, studying order

parse_pdf_azure_layout is the Azure twin of parse_pdf: identical name form, identical dict of tables out, so each downstream brick reads it with out figuring out which engine ran. The physique is value a glance, as a result of it’s the form each engine within the collection follows: make one name, then one small builder per desk, and reuse the engine-agnostic builders for the tables that solely want line_df.

def parse_pdf_azure_layout(pdf_path):
    consequence = analyze_pdf(pdf_path)               # one name, prebuilt-layout
    line_df  = azure_layout_pdf_to_line_df(pdf_path, consequence=consequence)
    image_df = build_image_df_azure_layout(consequence)         # + ocr_text
    toc_df   = build_toc_df_azure_layout(consequence)           # paragraph roles
    object_registry = build_object_registry_azure_layout(consequence)  # function tags
    page_df      = build_page_df(line_df)        # reused fitz builder (line_df solely)
    cross_ref_df = build_cross_ref_df(line_df)   # reused fitz builder (line_df solely)
    return {"line_df": line_df, "image_df": image_df, "toc_df": toc_df,
            "object_registry": object_registry, "page_df": page_df,
            "cross_ref_df": cross_ref_df, "span_df": pd.DataFrame(),
            "parsing_summary": parsing_summary}

Studying it high to backside: one analyze_pdf makes the Azure name as soon as, then one small builder per desk reads that shared consequence, and the 2 tables that solely want line_df, page_df and cross_ref_df, are produced by the exact same fitz builders the native parser makes use of. The dict on the finish is the contract each engine returns.

*The identical tables mirror `parse_pdf`, with per-row diffs vs fitz – Picture by creator*

3. What every desk beneficial properties

3.1. line_df beneficial properties table-cell rows, picture OCR, choice marks

A 4-column “Schedule of Costs” desk turns into 6 rows in line_df: the header row, the markdown separator, and 4 information rows.

*Every supply row turns into a `line_df` row; column construction carried contained in the markdown textual content – Picture by creator*

We hold the cells inside line_df as an alternative of including a separate table_cells_df. One desk for each downstream brick to learn; paragraph traces and desk rows look the identical on the best way out. The fee: per-cell queries want a markdown parse step. For RAG questions that is superb. The retriever matches key phrases on the row textual content. The LLM reads the markdown immediately.

OCR textual content from inside photographs additionally lands in line_df as further rows. Azure’s consequence.pages[i].traces already contains traces that fall inside determine areas, so the line-builder picks them up robotically. Choice marks (checkboxes) turn into single-character traces: [x] for chosen, [ ] for unselected. Varieties with check-the-box fields turn into queryable.

3.2. image_df beneficial properties an `ocr_text` column

Identical row, new column. For every detected determine, we listing each Azure phrase whose bbox overlaps the determine area by not less than 50% and be part of them as ocr_text.

*Consideration paper figures with their labels uncovered; textual content inside figures now retrievable – Picture by creator*

The identical column on a fitz-produced image_df is empty. The fitz parser doesn’t OCR photographs. When parsing_method == "fitz", the ocr_text column is there for form parity however stays clean. Downstream code that checks ocr_text != "" works the identical whether or not the row got here from fitz or Azure.

3.3. toc_df will get reconstructed from paragraph roles

When the PDF has native bookmarks, the fitz build_toc_df is actual and free: it reads what the creator wrote. When it doesn’t (most enterprise paperwork), fitz returns an empty toc_df and downstream levels lose the part construction.

The Azure builder walks consequence.paragraphs, filters by function in {"title", "sectionHeading"}, and assembles a TOC. Stage 1 = title, degree 2 = sectionHeading. The hierarchy comes from the order paragraphs seem within the doc. The identical start_page, end_page, start_y, breadcrumb columns because the fitz TOC. The lookback move that computes end_page (the subsequent peer-or-ancestor’s start_page, or total_pages for the final part) is an identical to the fitz one; the one distinction is the place the rows come from.

The reconstruction shouldn’t be excellent. Azure can not inform sub-section ranges aside past sectionHeading. The hierarchy you get is two-deep at most. For many enterprise queries that is sufficient: a piece stamped “Schedule of Costs” lets the LLM floor its reply to the correct part even with out the complete Article 14 > Schedule of Costs path.

3.4. object_registry will get caption-role detection

Fitz detects captions by regex anchored at the beginning of a line: ^Determine d+b, ^Desk d+b. Two failure modes. False negatives when the caption format differs (Fig. 2. as an alternative of Determine 2, or a multi-line wrap that pushes the quantity off the primary line). False positives when a body-text sentence occurs to begin with “Determine 2 reveals…”.

Azure skips the regex drawback. Its paragraphs area tags "figureCaption" and "tableCaption" explicitly. We learn the function immediately. The (object_type, object_id) be part of key into cross_ref_df remains to be pulled from the caption textual content by the identical regex the fitz builder makes use of, so the be part of works the identical with both engine. The win is recall: Azure catches captions fitz misses. The fee stays the identical (one Azure name, the result’s reused throughout builders).

3.5. parsing_summary beneficial properties Azure-specific stats

Three new fields land within the doc-level synthesis dict:

n_tables_detected: what number of tables Azure discovered (zero on a pure-prose doc, non-zero on a contract with tables).
n_figures: what number of figures the structure mannequin recognized.
n_selection_marks: what number of checkboxes (stuffed or empty) Azure detected throughout all pages.

These three counts make routing a doc simple. A 30-page doc with n_tables_detected = 18 seems like a contract and the desk construction issues. A doc with n_selection_marks = 0 might be not a kind. A doc with n_figures = 0 is text-only; no level working picture OCR.

3.6. page_df and cross_ref_df: unchanged

Two tables keep the identical form. page_df and cross_ref_df are constructed from line_df alone, so the engine that produced line_df is irrelevant. One implementation, two engines, no drift.

span_df is empty beneath Azure. The structure mannequin doesn’t expose sub-line typography (per-word daring or italic). If you want spans for heading detection or time period emphasis, keep on fitz for that doc. The 2 engines complement one another.

4. The parsing_method column: provenance for adaptive parsing

Each per-row desk from parse_pdf_azure_layout carries parsing_method == "azure_layout". Each per-row desk from parse_pdf (the fitz one) carries parsing_method == "fitz". Identical column, identical title, each engines. The purpose is downstream.

*Contract on fitz, web page 14 re-parsed with Azure; each engines coexist by way of `parsing_method` – Picture by creator*

That is what adaptive parsing (Article 10) consumes. The default move makes use of fitz. Pages that fail a pre-parse verify (desk area detected with no rows extracted, image-heavy web page with sparse textual content, OCR layer with low high quality) get re-parsed by Azure. The re-parsed rows exchange or append to the unique line_df rows. The parsing_method column retains the path.

Three downstream patterns the column allows:

De-duplication: when the identical web page bought each passes, hold azure rows over fitz rows (df.sort_values("parsing_method").drop_duplicates(["page_num", "line_num"], hold="first") if "azure_layout" < "fitz" lexicographically, or use an express priority map).
Audit: a query that lands on a row with parsing_method == "azure_layout" prices extra to confirm (Azure was wanted). The reply’s confidence weighting can use this.
Price accounting: (line_df.parsing_method == "azure_layout").any() per web page tells you which of them pages went via Azure and how one can invoice the parsing time.

5. Price and latency

Azure shouldn’t be free. Three numbers matter.

Latency: one web page via prebuilt-layout returns in 2 to 4 seconds. A 30-page doc takes 60 to 120 seconds. Fitz parses the identical doc in beneath a second. When the person is ready for a question, parse with fitz first. Escalate to Azure solely on pages fitz dealt with poorly.

Cash: Azure costs per web page. The prebuilt-layout tier is round US$10 per 1,000 pages as we speak. A 30-page contract prices roughly US$0.30. Parsing 1,000 such contracts a day is US$300/day if each web page goes via Azure. Limiting Azure to the pages that want it brings this down by 10x or extra.

Limits: the per-call PDF measurement restrict is 500 MB or 2,000 pages, whichever comes first. Bigger paperwork should be break up. The free tier (F0) permits 500 pages per thirty days and is ok for growth. Manufacturing often wants S0.

The order of magnitude is secure: fitz is free, Azure prices roughly a cent per web page. The precise tier costs change with area and time: deal with the numbers above as a calibration, not a contract. Article 10 picks which engine runs.

6. When to name which

Default to fitz. Escalate to Azure when a particular sign says fitz shouldn’t be sufficient.

Three indicators value wiring:

The web page has a desk area however fitz extracted few or no row-like constructions. Compute on line_df: cluster traces by y-coordinate, search for runs of brief uniform-spaced traces (an indication of cells). If the web page metadata says “desk detected” (from fitz’s web page.find_tables()) however the line sample doesn’t look table-like, escalate.
The web page is image-heavy with sparse textual content. image_df for the web page covers greater than 80% of the web page space and line_df has fewer than 10 rows on that web page. Scanned web page with no OCR layer, or a web page that’s one massive diagram with textual content inside. Both case wants Azure.
The OCR high quality rating is low: When fitz’s web page.get_text("textual content") returns scrambled OCR (excessive ratio of Unicode alternative characters, low dictionary-word ratio), re-OCR with Azure. The text_quality_score is computed in pre_parse_signals and browse by the dispatcher.

A fourth sign is easier. If the doc has no native TOC (fitz.toc_df.empty) and technology wants part context, run the doc as soon as via Azure to get a reconstructed TOC. One price per doc, not per question.

Article 10 builds the complete dispatcher. The parsing_method column is what lets each downstream stage learn which engine ran on which row.

7. Conclusion

Two engines, one contract: the identical relational tables out, identical downstream code no matter which one ran.

*Each functionality that issues for enterprise RAG, plus velocity and value – Picture by creator*

A parser doesn’t return textual content; it returns a mannequin of the doc. Azure makes that mannequin richer (cell-level tables, OCR inside figures, captions tagged by function, TOC reconstructed with out bookmarks) at 2 to 4 seconds and ~US$0.01 per web page. Fitz prices nothing and runs in milliseconds. The routing rule is easy: fitz by default, Azure when an upstream sign says fitz shouldn’t be sufficient. Article 10 wires the dispatcher.

8. Sources and additional studying

The prebuilt-layout mannequin behind parse_pdf_azure_layout is documented by Microsoft and rests on cell-level desk extraction analysis (Smock et al. 2022) plus a paragraph-role layer that converts visible areas into structural roles. Docling (Article 5ter) is the open-source equal of the identical cascade; it provides the identical desk contract on native {hardware}, helpful when paperwork can not depart the constructing.

Identical route because the article:

Microsoft, Azure AI Doc Intelligence. Structure mannequin. Official documentation for prebuilt-layout, the mannequin behind parse_pdf_azure_layout. The cell-level desk output, paragraph roles, and OCR protection all originate right here.
Smock, Pesala, Abraham, PubTables-1M / Desk Transformer (TATR), CVPR 2022 (arXiv:2110.00061). The analysis behind the cell-level desk extraction Azure ships; helpful for understanding what azure_layout is doing beneath the hood.

Completely different angle, totally different context:

Auer et al., Docling Technical Report, IBM Analysis 2024 (arXiv:2408.09869). Open-source native equal of the Azure structure cascade. Identical desk contract; trades cloud price for native compute. The suitable selection when confidentiality blocks the cloud add that Azure requires.

Earlier within the collection: