{"id":15688,"date":"2026-06-13T09:30:21","date_gmt":"2026-06-13T09:30:21","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=15688"},"modified":"2026-06-13T09:30:21","modified_gmt":"2026-06-13T09:30:21","slug":"when-pymupdf-cant-see-the-desk-parse-pdfs-for-rag-with-azure-structure","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=15688","title":{"rendered":"When PyMuPDF Can\u2019t See the Desk: Parse PDFs for RAG with Azure Structure"},"content":{"rendered":"

\n<\/p>\n

\n

companion in Enterprise Doc Intelligence<\/a>, the collection that builds an enterprise RAG system from 4 bricks. Article 5 (doc parsing)<\/a> constructed the parser with PyMuPDF (fitz). This companion retains the identical objective and the identical relational tables, and swaps the engine for Azure Structure<\/strong> (the prebuilt-layout<\/code> mannequin), a richer package deal that recovers what fitz can not. That hole is the place we begin.<\/p>\n

the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), with a special parsing engine \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n
PyMuPDF (fitz) is quick, free, and actual on clear prose. It additionally goes blind in three locations, and each is the place enterprise RAG quietly breaks.<\/p>\n
The desk on web page 14 of a contract. Fitz reads the cells one after the other and concatenates them. The column construction is gone. *\u201cRenewal charge 500 Setup charge 200\u201d<\/em> lands within the chunk. Your mannequin is requested to guess which quantity is which charge.<\/p>\n*
The scanned modification glued to the tip of the doc. Fitz reads the native pages and returns empty strings on the scanned ones. The person will get no reply on the modification as a result of the parser by no means learn it.<\/p>\n
The determine with textual content inside. A chart with axis labels. A signed seal stamp. A screenshot of a spreadsheet. Fitz returns the bbox of the picture. The textual content inside is gone.<\/p>\n
Azure Doc Intelligence<\/a> reads all three. It\u2019s a proprietary Microsoft Azure cloud service ruled by Microsoft\u2019s On-line Companies Phrases<\/a>. The `prebuilt-layout<\/code> mannequin returns native desk cells (rows, columns, headers), OCR textual content for each web page (native or scanned), figures with the textual content inside them, and paragraph roles (title<\/code>, sectionHeading<\/code>, figureCaption<\/code>, tableCaption<\/code>). One name. The identical relational tables as fitz, half of them enriched.<\/p>\n`
*`The downstream pipeline doesn’t care which engine produced the dict. Retrieval, technology, annotation learn rows. They by no means learn the PDF.<\/p>\n`*
The identical tables, Azure enriches half \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n1. The place fitz is blind<\/h2>\n4 circumstances. In each, fitz misses and Azure works.<\/p>\n1.1. Tables: fitz returns flat phrases, Azure returns cells<\/h3>\nA contract desk has rows and columns. The label \u201cRenewal charge\u201d<\/em> sits in column 1, the worth 500<\/em> sits in column 2. Fitz reads the web page high to backside and emits one line per textual content section. The 4 cells of a row come again as 4 free phrases. Generally the cells from the row under get combined in if the y-coordinates are shut. The chunker downstream sees a soup of phrases. The row-and-column construction that makes a desk a desk is gone.<\/p>\n Azure\u2019s prebuilt-layout<\/code> mannequin detects every desk as a structured object. consequence.tables<\/code> is an inventory of tables, every with cells<\/code> listed by (row_index, column_index)<\/code>. The header row is flagged (cell.sort == \"columnHeader\"<\/code>). The cell content material is the cell textual content, precisely because the creator typed it. We flatten the desk into markdown rows so it lives inside line_df<\/code> like some other content material. A four-cell row \u201cRenewal charge | 500 | Setup charge | 200\u201d<\/em> turns into one line_df<\/code> row with that markdown textual content. The header row will get a | --- | --- | ... |<\/code> separator so a downstream mannequin reads the construction again.<\/p>\n1.2. Pictures: fitz returns the bbox, Azure returns the textual content<\/h3>\nMany PDFs have figures with textual content inside them. Structure diagrams with field labels. Charts with axis ticks and legends. Signed seal stamps. Embedded screenshots of spreadsheets. Fitz returns every picture as a bbox and the uncooked bytes. The textual content inside is invisible to the parser.<\/p>\n Azure\u2019s OCR runs on each web page, together with the pixels inside determine areas. For every determine, we acquire each Azure phrase whose bbox sits contained in the determine area and be part of them as ocr_text<\/code>. \u201cMulti-Head Consideration Concat Linear h\u201d<\/em> now lives in image_df.ocr_text<\/code> for the determine on web page 4 of the Consideration paper. Retrieval can match a query about \u201cmulti-head consideration\u201d<\/em> even when the reply is textual content inside a determine.<\/p>\nfitz returns the bbox and an empty textual content cell; Azure\u2019s OCR recovers the labels printed contained in the determine \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n1.3. Scanned pages: fitz returns nothing, Azure returns OCR<\/h3>\nA 30-page native contract will get a 10-page scanned modification glued on the finish. Fitz reads the native pages and returns empty strings for the scanned ones. The parser doesn’t flag this. The downstream pipeline silently covers 75% of the doc. The person has no concept 25% is lacking.<\/p>\n Azure runs OCR on each web page no matter supply. Native pages and scanned pages come again via the identical consequence.pages[i].traces<\/code> path with the identical form. The parsing_method<\/code> column on line_df<\/code> lets downstream code inform which engine produced which rows. The parsing_summary<\/code> dict has a n_pages<\/code> area that matches the doc\u2019s precise web page depend, not simply the pages with native textual content.<\/p>\na scan is pixels, not characters; fitz has no textual content layer to learn, Azure OCRs the web page \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n1.4. Captions and headings: fitz makes use of regex, Azure has express roles<\/h3>\nFitz detects determine \/ desk captions by regex on the beginning of every line (^Determine d+b<\/code>, ^Desk d+b<\/code>). It really works when captions appear to be \u201cDetermine 2\u201d<\/em> and misses the remaining (\u201cFig. 2\u201d<\/em>, multi-line wraps). It additionally has false positives: a body-text sentence that begins with \u201cDetermine 2\u201d<\/em> will get picked up as a caption when it’s a point out.<\/p>\nthe 2 failure modes of caption-by-regex (a missed \u201cFig.\u201d caption, a physique point out wrongly flagged) that Azure\u2019s paragraph function avoids \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nAzure\u2019s paragraphs<\/code> area has function labels: every paragraph within the consequence carries a tag like \"figureCaption\"<\/code>, \"tableCaption\"<\/code>, \"title\"<\/code>, or \"sectionHeading\"<\/code> that tells us what sort of block it’s, with none regex. \"figureCaption\"<\/code> and \"tableCaption\"<\/code> populate object_registry<\/code> immediately. \"title\"<\/code> and \"sectionHeading\"<\/code> rebuild the TOC. The tag is Azure\u2019s structure mannequin naming the block\u2019s operate; fitz has no equal. The (object_type, object_id)<\/code> be part of key remains to be extracted by the identical regex on the caption textual content so cross_ref_df<\/code> joins again the identical approach.<\/p>\n The TOC is the extra attention-grabbing case. Fitz\u2019s build_toc_df<\/code> reads native bookmarks (doc.get_toc()<\/code>). When the PDF has no native bookmarks, fitz returns an empty TOC. That is the widespread enterprise case: Phrase exports, scanned paperwork, PDFs from kind turbines. Azure reconstructs the TOC from paragraph roles. Each \"title\"<\/code> paragraph turns into a level-1 entry, each \"sectionHeading\"<\/code> paragraph turns into level-2. The hierarchy comes from the order they seem. This isn’t excellent, nevertheless it produces a usable TOC the place fitz would produce nothing.<\/p>\n2. Identical contract, richer information<\/h2>\nOne operate. The identical tables as parse_pdf<\/code>, in the identical form. One Azure name shared by each builder. That decision is small: level the SDK on the doc with one model_id<\/code>, prebuilt-layout<\/code>. (The opposite prebuilt mannequin, prebuilt-read<\/code>, is OCR solely; the structure mannequin is the one which additionally returns tables, paragraph roles, and studying order.)<\/p>\nfrom azure.ai.documentintelligence import DocumentIntelligenceClient\nfrom azure.ai.documentintelligence.fashions import AnalyzeDocumentRequest\nfrom azure.core.credentials import AzureKeyCredential\n\nconsumer = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))\n\n# \"Structure\" = the prebuilt-layout mannequin (NOT prebuilt-read, which is OCR solely)\nwith open(\"contract.pdf\", \"rb\") as f:\n poller = consumer.begin_analyze_document(\n \"prebuilt-layout\",\n AnalyzeDocumentRequest(bytes_source=f.learn()),\n )\n\nconsequence = poller.consequence() # tables, paragraph roles, OCR, studying order<\/code><\/pre>\nparse_pdf_azure_layout<\/code> is the Azure twin of parse_pdf<\/code>: identical name form, identical dict of tables out, so each downstream brick reads it with out figuring out which engine ran. The physique is value a glance, as a result of it’s the form each engine within the collection follows: make one name, then one small builder per desk, and reuse the engine-agnostic builders for the tables that solely want line_df<\/code>.<\/p>\ndef parse_pdf_azure_layout(pdf_path):\n consequence = analyze_pdf(pdf_path) # one name, prebuilt-layout\n line_df = azure_layout_pdf_to_line_df(pdf_path, consequence=consequence)\n image_df = build_image_df_azure_layout(consequence) # + ocr_text\n toc_df = build_toc_df_azure_layout(consequence) # paragraph roles\n object_registry = build_object_registry_azure_layout(consequence) # function tags\n page_df = build_page_df(line_df) # reused fitz builder (line_df solely)\n cross_ref_df = build_cross_ref_df(line_df) # reused fitz builder (line_df solely)\n return {\"line_df\": line_df, \"image_df\": image_df, \"toc_df\": toc_df,\n \"object_registry\": object_registry, \"page_df\": page_df,\n \"cross_ref_df\": cross_ref_df, \"span_df\": pd.DataFrame(),\n \"parsing_summary\": parsing_summary}<\/code><\/pre>\nStudying it high to backside: one analyze_pdf<\/code> makes the Azure name as soon as, then one small builder per desk reads that shared consequence<\/code>, and the 2 tables that solely want line_df<\/code>, page_df<\/code> and cross_ref_df<\/code>, are produced by the exact same fitz builders the native parser makes use of. The dict on the finish is the contract each engine returns.<\/p>\nThe identical tables mirror parse_pdf<\/code>, with per-row diffs vs fitz \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n3. What every desk beneficial properties<\/h2>\n3.1. line_df beneficial properties table-cell rows, picture OCR, choice marks<\/h3>\nA 4-column \u201cSchedule of Costs\u201d desk turns into 6 rows in line_df<\/code>: the header row, the markdown separator, and 4 information rows.<\/p>\nEvery supply row turns into a line_df<\/code> row; column construction carried contained in the markdown textual content \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nWe hold the cells inside line_df<\/code> as an alternative of including a separate table_cells_df<\/code>. One desk for each downstream brick to learn; paragraph traces and desk rows look the identical on the best way out. The fee: per-cell queries want a markdown parse step. For RAG questions that is superb. The retriever matches key phrases on the row textual content. The LLM reads the markdown immediately.<\/p>\n OCR textual content from inside photographs additionally lands in line_df<\/code> as further rows. Azure\u2019s consequence.pages[i].traces<\/code> already contains traces that fall inside determine areas, so the line-builder picks them up robotically. Choice marks (checkboxes) turn into single-character traces: [x]<\/code> for chosen, [ ]<\/code> for unselected. Varieties with check-the-box fields turn into queryable.<\/p>\n3.2. image_df beneficial properties an ocr_text<\/code> column<\/h3>\nIdentical row, new column. For every detected determine, we listing each Azure phrase whose bbox overlaps the determine area by not less than 50% and be part of them as ocr_text<\/code>.<\/p>\nConsideration paper figures with their labels uncovered; textual content inside figures now retrievable \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nThe identical column on a fitz-produced image_df<\/code> is empty. The fitz parser doesn’t OCR photographs. When parsing_method == \"fitz\"<\/code>, the ocr_text<\/code> column is there for form parity however stays clean. Downstream code that checks ocr_text != \"\"<\/code> works the identical whether or not the row got here from fitz or Azure.<\/p>\n3.3. toc_df will get reconstructed from paragraph roles<\/h3>\nWhen the PDF has native bookmarks, the fitz build_toc_df<\/code> is actual and free: it reads what the creator wrote. When it doesn\u2019t (most enterprise paperwork), fitz returns an empty toc_df<\/code> and downstream levels lose the part construction.<\/p>\n The Azure builder walks consequence.paragraphs<\/code>, filters by function in {\"title\", \"sectionHeading\"}<\/code>, and assembles a TOC. Stage 1 = title, degree 2 = sectionHeading. The hierarchy comes from the order paragraphs seem within the doc. The identical start_page<\/code>, end_page<\/code>, start_y<\/code>, breadcrumb<\/code> columns because the fitz TOC. The lookback move that computes end_page<\/code> (the subsequent peer-or-ancestor\u2019s start_page<\/code>, or total_pages<\/code> for the final part) is an identical to the fitz one; the one distinction is the place the rows come from.<\/p>\n The reconstruction shouldn’t be excellent. Azure can not inform sub-section ranges aside past sectionHeading<\/code>. The hierarchy you get is two-deep at most. For many enterprise queries that is sufficient: a piece stamped \u201cSchedule of Costs\u201d<\/em> lets the LLM floor its reply to the correct part even with out the complete Article 14 > Schedule of Costs<\/em> path.<\/p>\n3.4. object_registry will get caption-role detection<\/h3>\nFitz detects captions by regex anchored at the beginning of a line: ^Determine d+b<\/code>, ^Desk d+b<\/code>. Two failure modes. False negatives when the caption format differs (Fig. 2.<\/code> as an alternative of Determine 2<\/code>, or a multi-line wrap that pushes the quantity off the primary line). False positives when a body-text sentence occurs to begin with \u201cDetermine 2 reveals\u2026\u201d<\/em>.<\/p>\n Azure skips the regex drawback. Its paragraphs<\/code> area tags \"figureCaption\"<\/code> and \"tableCaption\"<\/code> explicitly. We learn the function immediately. The (object_type, object_id)<\/code> be part of key into cross_ref_df<\/code> remains to be pulled from the caption textual content by the identical regex the fitz builder makes use of, so the be part of works the identical with both engine. The win is recall: Azure catches captions fitz misses. The fee stays the identical (one Azure name, the result’s reused throughout builders).<\/p>\n3.5. parsing_summary beneficial properties Azure-specific stats<\/h3>\nThree new fields land within the doc-level synthesis dict:<\/p>\n\nn_tables_detected<\/code>: what number of tables Azure discovered (zero on a pure-prose doc, non-zero on a contract with tables).<\/li>\n n_figures<\/code>: what number of figures the structure mannequin recognized.<\/li>\nn_selection_marks<\/code>: what number of checkboxes (stuffed or empty) Azure detected throughout all pages.<\/li>\n<\/ul>\nThese three counts make routing a doc simple. A 30-page doc with n_tables_detected = 18<\/code> seems like a contract and the desk construction issues. A doc with n_selection_marks = 0<\/code> might be not a kind. A doc with n_figures = 0<\/code> is text-only; no level working picture OCR.<\/p>\n3.6. page_df and cross_ref_df: unchanged<\/h3>\nTwo tables keep the identical form. page_df<\/code> and cross_ref_df<\/code> are constructed from line_df<\/code> alone, so the engine that produced line_df<\/code> is irrelevant. One implementation, two engines, no drift.<\/p>\n span_df<\/code> is empty beneath Azure. The structure mannequin doesn’t expose sub-line typography (per-word daring or italic). If you want spans for heading detection or time period emphasis, keep on fitz for that doc. The 2 engines complement one another.<\/p>\n4. The parsing_method column: provenance for adaptive parsing<\/h2>\nEach per-row desk from parse_pdf_azure_layout<\/code> carries parsing_method == \"azure_layout\"<\/code>. Each per-row desk from parse_pdf<\/code> (the fitz one) carries parsing_method == \"fitz\"<\/code>. Identical column, identical title, each engines. The purpose is downstream.<\/p>\nContract on fitz, web page 14 re-parsed with Azure; each engines coexist by way of parsing_method<\/code> \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nThat is what adaptive parsing (Article 10) consumes. The default move makes use of fitz. Pages that fail a pre-parse verify (desk area detected with no rows extracted, image-heavy web page with sparse textual content, OCR layer with low high quality) get re-parsed by Azure. The re-parsed rows exchange or append to the unique line_df<\/code> rows. The parsing_method<\/code> column retains the path.<\/p>\n Three downstream patterns the column allows:<\/p>\n\nDe-duplication<\/strong>: when the identical web page bought each passes, hold azure rows over fitz rows (df.sort_values(\"parsing_method\").drop_duplicates([\"page_num\", \"line_num\"], hold=\"first\")<\/code> if \"azure_layout\" < \"fitz\"<\/code> lexicographically, or use an express priority map).<\/li>\n Audit<\/strong>: a query that lands on a row with parsing_method == \"azure_layout\"<\/code> prices extra to confirm (Azure was wanted). The reply\u2019s confidence weighting can use this.<\/li>\nPrice accounting<\/strong>: (line_df.parsing_method == \"azure_layout\").any()<\/code> per web page tells you which of them pages went via Azure and how one can invoice the parsing time.<\/li>\n<\/ul>\n5. Price and latency<\/h2>\nAzure shouldn’t be free. Three numbers matter.<\/p>\n Latency<\/strong>: one web page via prebuilt-layout<\/code> returns in 2 to 4 seconds. A 30-page doc takes 60 to 120 seconds. Fitz parses the identical doc in beneath a second. When the person is ready for a question, parse with fitz first. Escalate to Azure solely on pages fitz dealt with poorly.<\/p>\n Cash<\/strong>: Azure costs per web page. The prebuilt-layout<\/code> tier is round US$10 per 1,000 pages as we speak. A 30-page contract prices roughly US$0.30. Parsing 1,000 such contracts a day is US$300\/day if each web page goes via Azure. Limiting Azure to the pages that want it brings this down by 10x or extra.<\/p>\n Limits<\/strong>: the per-call PDF measurement restrict is 500 MB or 2,000 pages, whichever comes first. Bigger paperwork should be break up. The free tier (F0) permits 500 pages per thirty days and is ok for growth. Manufacturing often wants S0.<\/p>\n The order of magnitude is secure: fitz is free, Azure prices roughly a cent per web page. The precise tier costs change with area and time: deal with the numbers above as a calibration, not a contract. Article 10 picks which engine runs.<\/p>\n6. When to name which<\/h2>\nDefault to fitz. Escalate to Azure when a particular sign says fitz shouldn’t be sufficient.<\/p>\n Three indicators value wiring:<\/p>\n\nThe web page has a desk area however fitz extracted few or no row-like constructions.<\/strong> Compute on line_df<\/code>: cluster traces by y-coordinate, search for runs of brief uniform-spaced traces (an indication of cells). If the web page metadata says \u201cdesk detected\u201d (from fitz\u2019s web page.find_tables()<\/code>) however the line sample doesn’t look table-like, escalate.<\/li>\n The web page is image-heavy with sparse textual content.<\/strong> image_df<\/code> for the web page covers greater than 80% of the web page space and line_df<\/code> has fewer than 10 rows on that web page. Scanned web page with no OCR layer, or a web page that’s one massive diagram with textual content inside. Both case wants Azure.<\/li>\nThe OCR high quality rating is low:<\/strong> When fitz\u2019s web page.get_text(\"textual content\")<\/code> returns scrambled OCR (excessive ratio of Unicode alternative characters, low dictionary-word ratio), re-OCR with Azure. The text_quality_score<\/code> is computed in pre_parse_signals<\/code> and browse by the dispatcher.<\/li>\n<\/ol>\nA fourth sign is easier. If the doc has no native TOC (fitz.toc_df.empty<\/code>) and technology wants part context, run the doc as soon as via Azure to get a reconstructed TOC. One price per doc, not per question.<\/p>\n Article 10 builds the complete dispatcher. The parsing_method<\/code> column is what lets each downstream stage learn which engine ran on which row.<\/p>\n7. Conclusion<\/h2>\nTwo engines, one contract: the identical relational tables out, identical downstream code no matter which one ran.<\/p>\nEach functionality that issues for enterprise RAG, plus velocity and value \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nA parser doesn’t return textual content; it returns a mannequin of the doc. Azure makes that mannequin richer (cell-level tables, OCR inside figures, captions tagged by function, TOC reconstructed with out bookmarks) at 2 to 4 seconds and ~US$0.01 per web page. Fitz prices nothing and runs in milliseconds. The routing rule is easy: fitz by default, Azure when an upstream sign says fitz shouldn’t be sufficient. Article 10 wires the dispatcher.<\/p>\n8. Sources and additional studying<\/h2>\nThe prebuilt-layout<\/code> mannequin behind parse_pdf_azure_layout<\/code> is documented by Microsoft and rests on cell-level desk extraction analysis (Smock et al.\u00a02022) plus a paragraph-role layer that converts visible areas into structural roles. Docling (Article 5ter) is the open-source equal of the identical cascade; it provides the identical desk contract on native {hardware}, helpful when paperwork can not depart the constructing.<\/p>\n Identical route because the article:<\/strong><\/p>\n\nMicrosoft, Azure AI Doc Intelligence. Structure mannequin<\/a><\/em>. Official documentation for prebuilt-layout<\/code>, the mannequin behind parse_pdf_azure_layout<\/code>. The cell-level desk output, paragraph roles, and OCR protection all originate right here.<\/li>\n Smock, Pesala, Abraham, PubTables-1M \/ Desk Transformer (TATR)<\/em>, CVPR 2022 (arXiv:2110.00061<\/a>). The analysis behind the cell-level desk extraction Azure ships; helpful for understanding what azure_layout<\/code> is doing beneath the hood.<\/li>\n<\/ul>\nCompletely different angle, totally different context:<\/strong><\/p>\n\nAuer et al., Docling Technical Report<\/em>, IBM Analysis 2024 (arXiv:2408.09869<\/a>). Open-source native equal of the Azure structure cascade. Identical desk contract; trades cloud price for native compute. The suitable selection when confidentiality blocks the cloud add that Azure requires.<\/li>\n<\/ul>\nEarlier within the collection:<\/strong><\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":" companion in Enterprise Doc Intelligence, the collection that builds an enterprise RAG system from 4 bricks. Article 5 (doc parsing) constructed the parser with PyMuPDF (fitz). This companion retains the identical objective and the identical relational tables, and swaps the engine for Azure Structure (the prebuilt-layout mannequin), a richer package deal that recovers what fitz […]<\/p>\n","protected":false},"author":2,"featured_media":15690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[1126,2977,9408,4114,9406,1729,9407],"class_list":["post-15688","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-azure","tag-layout","tag-parse","tag-pdfs","tag-pymupdf","tag-rag","tag-table"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15688"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688\/revisions"}],"predecessor-version":[{"id":15689,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688\/revisions\/15689"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15690"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15688"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15688"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15688"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), with a special parsing engine \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n
PyMuPDF (fitz) is quick, free, and actual on clear prose. It additionally goes blind in three locations, and each is the place enterprise RAG quietly breaks.<\/p>\n
The desk on web page 14 of a contract. Fitz reads the cells one after the other and concatenates them. The column construction is gone. *\u201cRenewal charge 500 Setup charge 200\u201d<\/em> lands within the chunk. Your mannequin is requested to guess which quantity is which charge.<\/p>\n*
The scanned modification glued to the tip of the doc. Fitz reads the native pages and returns empty strings on the scanned ones. The person will get no reply on the modification as a result of the parser by no means learn it.<\/p>\n
The determine with textual content inside. A chart with axis labels. A signed seal stamp. A screenshot of a spreadsheet. Fitz returns the bbox of the picture. The textual content inside is gone.<\/p>\n
Azure Doc Intelligence<\/a> reads all three. It\u2019s a proprietary Microsoft Azure cloud service ruled by Microsoft\u2019s On-line Companies Phrases<\/a>. The `prebuilt-layout<\/code> mannequin returns native desk cells (rows, columns, headers), OCR textual content for each web page (native or scanned), figures with the textual content inside them, and paragraph roles (title<\/code>, sectionHeading<\/code>, figureCaption<\/code>, tableCaption<\/code>). One name. The identical relational tables as fitz, half of them enriched.<\/p>\n`
*`The downstream pipeline doesn’t care which engine produced the dict. Retrieval, technology, annotation learn rows. They by no means learn the PDF.<\/p>\n`*
The identical tables, Azure enriches half \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n1. The place fitz is blind<\/h2>\n4 circumstances. In each, fitz misses and Azure works.<\/p>\n1.1. Tables: fitz returns flat phrases, Azure returns cells<\/h3>\nA contract desk has rows and columns. The label \u201cRenewal charge\u201d<\/em> sits in column 1, the worth 500<\/em> sits in column 2. Fitz reads the web page high to backside and emits one line per textual content section. The 4 cells of a row come again as 4 free phrases. Generally the cells from the row under get combined in if the y-coordinates are shut. The chunker downstream sees a soup of phrases. The row-and-column construction that makes a desk a desk is gone.<\/p>\n Azure\u2019s prebuilt-layout<\/code> mannequin detects every desk as a structured object. consequence.tables<\/code> is an inventory of tables, every with cells<\/code> listed by (row_index, column_index)<\/code>. The header row is flagged (cell.sort == \"columnHeader\"<\/code>). The cell content material is the cell textual content, precisely because the creator typed it. We flatten the desk into markdown rows so it lives inside line_df<\/code> like some other content material. A four-cell row \u201cRenewal charge | 500 | Setup charge | 200\u201d<\/em> turns into one line_df<\/code> row with that markdown textual content. The header row will get a | --- | --- | ... |<\/code> separator so a downstream mannequin reads the construction again.<\/p>\n1.2. Pictures: fitz returns the bbox, Azure returns the textual content<\/h3>\nMany PDFs have figures with textual content inside them. Structure diagrams with field labels. Charts with axis ticks and legends. Signed seal stamps. Embedded screenshots of spreadsheets. Fitz returns every picture as a bbox and the uncooked bytes. The textual content inside is invisible to the parser.<\/p>\n Azure\u2019s OCR runs on each web page, together with the pixels inside determine areas. For every determine, we acquire each Azure phrase whose bbox sits contained in the determine area and be part of them as ocr_text<\/code>. \u201cMulti-Head Consideration Concat Linear h\u201d<\/em> now lives in image_df.ocr_text<\/code> for the determine on web page 4 of the Consideration paper. Retrieval can match a query about \u201cmulti-head consideration\u201d<\/em> even when the reply is textual content inside a determine.<\/p>\nfitz returns the bbox and an empty textual content cell; Azure\u2019s OCR recovers the labels printed contained in the determine \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n1.3. Scanned pages: fitz returns nothing, Azure returns OCR<\/h3>\nA 30-page native contract will get a 10-page scanned modification glued on the finish. Fitz reads the native pages and returns empty strings for the scanned ones. The parser doesn’t flag this. The downstream pipeline silently covers 75% of the doc. The person has no concept 25% is lacking.<\/p>\n Azure runs OCR on each web page no matter supply. Native pages and scanned pages come again via the identical consequence.pages[i].traces<\/code> path with the identical form. The parsing_method<\/code> column on line_df<\/code> lets downstream code inform which engine produced which rows. The parsing_summary<\/code> dict has a n_pages<\/code> area that matches the doc\u2019s precise web page depend, not simply the pages with native textual content.<\/p>\na scan is pixels, not characters; fitz has no textual content layer to learn, Azure OCRs the web page \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n1.4. Captions and headings: fitz makes use of regex, Azure has express roles<\/h3>\nFitz detects determine \/ desk captions by regex on the beginning of every line (^Determine d+b<\/code>, ^Desk d+b<\/code>). It really works when captions appear to be \u201cDetermine 2\u201d<\/em> and misses the remaining (\u201cFig. 2\u201d<\/em>, multi-line wraps). It additionally has false positives: a body-text sentence that begins with \u201cDetermine 2\u201d<\/em> will get picked up as a caption when it’s a point out.<\/p>\nthe 2 failure modes of caption-by-regex (a missed \u201cFig.\u201d caption, a physique point out wrongly flagged) that Azure\u2019s paragraph function avoids \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nAzure\u2019s paragraphs<\/code> area has function labels: every paragraph within the consequence carries a tag like \"figureCaption\"<\/code>, \"tableCaption\"<\/code>, \"title\"<\/code>, or \"sectionHeading\"<\/code> that tells us what sort of block it’s, with none regex. \"figureCaption\"<\/code> and \"tableCaption\"<\/code> populate object_registry<\/code> immediately. \"title\"<\/code> and \"sectionHeading\"<\/code> rebuild the TOC. The tag is Azure\u2019s structure mannequin naming the block\u2019s operate; fitz has no equal. The (object_type, object_id)<\/code> be part of key remains to be extracted by the identical regex on the caption textual content so cross_ref_df<\/code> joins again the identical approach.<\/p>\n The TOC is the extra attention-grabbing case. Fitz\u2019s build_toc_df<\/code> reads native bookmarks (doc.get_toc()<\/code>). When the PDF has no native bookmarks, fitz returns an empty TOC. That is the widespread enterprise case: Phrase exports, scanned paperwork, PDFs from kind turbines. Azure reconstructs the TOC from paragraph roles. Each \"title\"<\/code> paragraph turns into a level-1 entry, each \"sectionHeading\"<\/code> paragraph turns into level-2. The hierarchy comes from the order they seem. This isn’t excellent, nevertheless it produces a usable TOC the place fitz would produce nothing.<\/p>\n2. Identical contract, richer information<\/h2>\nOne operate. The identical tables as parse_pdf<\/code>, in the identical form. One Azure name shared by each builder. That decision is small: level the SDK on the doc with one model_id<\/code>, prebuilt-layout<\/code>. (The opposite prebuilt mannequin, prebuilt-read<\/code>, is OCR solely; the structure mannequin is the one which additionally returns tables, paragraph roles, and studying order.)<\/p>\nfrom azure.ai.documentintelligence import DocumentIntelligenceClient\nfrom azure.ai.documentintelligence.fashions import AnalyzeDocumentRequest\nfrom azure.core.credentials import AzureKeyCredential\n\nconsumer = DocumentIntelligenceClient(endpoint, AzureKeyCredential(key))\n\n# \"Structure\" = the prebuilt-layout mannequin (NOT prebuilt-read, which is OCR solely)\nwith open(\"contract.pdf\", \"rb\") as f:\n poller = consumer.begin_analyze_document(\n \"prebuilt-layout\",\n AnalyzeDocumentRequest(bytes_source=f.learn()),\n )\n\nconsequence = poller.consequence() # tables, paragraph roles, OCR, studying order<\/code><\/pre>\nparse_pdf_azure_layout<\/code> is the Azure twin of parse_pdf<\/code>: identical name form, identical dict of tables out, so each downstream brick reads it with out figuring out which engine ran. The physique is value a glance, as a result of it’s the form each engine within the collection follows: make one name, then one small builder per desk, and reuse the engine-agnostic builders for the tables that solely want line_df<\/code>.<\/p>\ndef parse_pdf_azure_layout(pdf_path):\n consequence = analyze_pdf(pdf_path) # one name, prebuilt-layout\n line_df = azure_layout_pdf_to_line_df(pdf_path, consequence=consequence)\n image_df = build_image_df_azure_layout(consequence) # + ocr_text\n toc_df = build_toc_df_azure_layout(consequence) # paragraph roles\n object_registry = build_object_registry_azure_layout(consequence) # function tags\n page_df = build_page_df(line_df) # reused fitz builder (line_df solely)\n cross_ref_df = build_cross_ref_df(line_df) # reused fitz builder (line_df solely)\n return {\"line_df\": line_df, \"image_df\": image_df, \"toc_df\": toc_df,\n \"object_registry\": object_registry, \"page_df\": page_df,\n \"cross_ref_df\": cross_ref_df, \"span_df\": pd.DataFrame(),\n \"parsing_summary\": parsing_summary}<\/code><\/pre>\nStudying it high to backside: one analyze_pdf<\/code> makes the Azure name as soon as, then one small builder per desk reads that shared consequence<\/code>, and the 2 tables that solely want line_df<\/code>, page_df<\/code> and cross_ref_df<\/code>, are produced by the exact same fitz builders the native parser makes use of. The dict on the finish is the contract each engine returns.<\/p>\nThe identical tables mirror parse_pdf<\/code>, with per-row diffs vs fitz \u2013 Picture by creator<\/em><\/figcaption><\/figure>\n3. What every desk beneficial properties<\/h2>\n3.1. line_df beneficial properties table-cell rows, picture OCR, choice marks<\/h3>\nA 4-column \u201cSchedule of Costs\u201d desk turns into 6 rows in line_df<\/code>: the header row, the markdown separator, and 4 information rows.<\/p>\nEvery supply row turns into a line_df<\/code> row; column construction carried contained in the markdown textual content \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nWe hold the cells inside line_df<\/code> as an alternative of including a separate table_cells_df<\/code>. One desk for each downstream brick to learn; paragraph traces and desk rows look the identical on the best way out. The fee: per-cell queries want a markdown parse step. For RAG questions that is superb. The retriever matches key phrases on the row textual content. The LLM reads the markdown immediately.<\/p>\n OCR textual content from inside photographs additionally lands in line_df<\/code> as further rows. Azure\u2019s consequence.pages[i].traces<\/code> already contains traces that fall inside determine areas, so the line-builder picks them up robotically. Choice marks (checkboxes) turn into single-character traces: [x]<\/code> for chosen, [ ]<\/code> for unselected. Varieties with check-the-box fields turn into queryable.<\/p>\n3.2. image_df beneficial properties an ocr_text<\/code> column<\/h3>\nIdentical row, new column. For every detected determine, we listing each Azure phrase whose bbox overlaps the determine area by not less than 50% and be part of them as ocr_text<\/code>.<\/p>\nConsideration paper figures with their labels uncovered; textual content inside figures now retrievable \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nThe identical column on a fitz-produced image_df<\/code> is empty. The fitz parser doesn’t OCR photographs. When parsing_method == \"fitz\"<\/code>, the ocr_text<\/code> column is there for form parity however stays clean. Downstream code that checks ocr_text != \"\"<\/code> works the identical whether or not the row got here from fitz or Azure.<\/p>\n3.3. toc_df will get reconstructed from paragraph roles<\/h3>\nWhen the PDF has native bookmarks, the fitz build_toc_df<\/code> is actual and free: it reads what the creator wrote. When it doesn\u2019t (most enterprise paperwork), fitz returns an empty toc_df<\/code> and downstream levels lose the part construction.<\/p>\n The Azure builder walks consequence.paragraphs<\/code>, filters by function in {\"title\", \"sectionHeading\"}<\/code>, and assembles a TOC. Stage 1 = title, degree 2 = sectionHeading. The hierarchy comes from the order paragraphs seem within the doc. The identical start_page<\/code>, end_page<\/code>, start_y<\/code>, breadcrumb<\/code> columns because the fitz TOC. The lookback move that computes end_page<\/code> (the subsequent peer-or-ancestor\u2019s start_page<\/code>, or total_pages<\/code> for the final part) is an identical to the fitz one; the one distinction is the place the rows come from.<\/p>\n The reconstruction shouldn’t be excellent. Azure can not inform sub-section ranges aside past sectionHeading<\/code>. The hierarchy you get is two-deep at most. For many enterprise queries that is sufficient: a piece stamped \u201cSchedule of Costs\u201d<\/em> lets the LLM floor its reply to the correct part even with out the complete Article 14 > Schedule of Costs<\/em> path.<\/p>\n3.4. object_registry will get caption-role detection<\/h3>\nFitz detects captions by regex anchored at the beginning of a line: ^Determine d+b<\/code>, ^Desk d+b<\/code>. Two failure modes. False negatives when the caption format differs (Fig. 2.<\/code> as an alternative of Determine 2<\/code>, or a multi-line wrap that pushes the quantity off the primary line). False positives when a body-text sentence occurs to begin with \u201cDetermine 2 reveals\u2026\u201d<\/em>.<\/p>\n Azure skips the regex drawback. Its paragraphs<\/code> area tags \"figureCaption\"<\/code> and \"tableCaption\"<\/code> explicitly. We learn the function immediately. The (object_type, object_id)<\/code> be part of key into cross_ref_df<\/code> remains to be pulled from the caption textual content by the identical regex the fitz builder makes use of, so the be part of works the identical with both engine. The win is recall: Azure catches captions fitz misses. The fee stays the identical (one Azure name, the result’s reused throughout builders).<\/p>\n3.5. parsing_summary beneficial properties Azure-specific stats<\/h3>\nThree new fields land within the doc-level synthesis dict:<\/p>\n\nn_tables_detected<\/code>: what number of tables Azure discovered (zero on a pure-prose doc, non-zero on a contract with tables).<\/li>\n n_figures<\/code>: what number of figures the structure mannequin recognized.<\/li>\nn_selection_marks<\/code>: what number of checkboxes (stuffed or empty) Azure detected throughout all pages.<\/li>\n<\/ul>\nThese three counts make routing a doc simple. A 30-page doc with n_tables_detected = 18<\/code> seems like a contract and the desk construction issues. A doc with n_selection_marks = 0<\/code> might be not a kind. A doc with n_figures = 0<\/code> is text-only; no level working picture OCR.<\/p>\n3.6. page_df and cross_ref_df: unchanged<\/h3>\nTwo tables keep the identical form. page_df<\/code> and cross_ref_df<\/code> are constructed from line_df<\/code> alone, so the engine that produced line_df<\/code> is irrelevant. One implementation, two engines, no drift.<\/p>\n span_df<\/code> is empty beneath Azure. The structure mannequin doesn’t expose sub-line typography (per-word daring or italic). If you want spans for heading detection or time period emphasis, keep on fitz for that doc. The 2 engines complement one another.<\/p>\n4. The parsing_method column: provenance for adaptive parsing<\/h2>\nEach per-row desk from parse_pdf_azure_layout<\/code> carries parsing_method == \"azure_layout\"<\/code>. Each per-row desk from parse_pdf<\/code> (the fitz one) carries parsing_method == \"fitz\"<\/code>. Identical column, identical title, each engines. The purpose is downstream.<\/p>\nContract on fitz, web page 14 re-parsed with Azure; each engines coexist by way of parsing_method<\/code> \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nThat is what adaptive parsing (Article 10) consumes. The default move makes use of fitz. Pages that fail a pre-parse verify (desk area detected with no rows extracted, image-heavy web page with sparse textual content, OCR layer with low high quality) get re-parsed by Azure. The re-parsed rows exchange or append to the unique line_df<\/code> rows. The parsing_method<\/code> column retains the path.<\/p>\n Three downstream patterns the column allows:<\/p>\n\nDe-duplication<\/strong>: when the identical web page bought each passes, hold azure rows over fitz rows (df.sort_values(\"parsing_method\").drop_duplicates([\"page_num\", \"line_num\"], hold=\"first\")<\/code> if \"azure_layout\" < \"fitz\"<\/code> lexicographically, or use an express priority map).<\/li>\n Audit<\/strong>: a query that lands on a row with parsing_method == \"azure_layout\"<\/code> prices extra to confirm (Azure was wanted). The reply\u2019s confidence weighting can use this.<\/li>\nPrice accounting<\/strong>: (line_df.parsing_method == \"azure_layout\").any()<\/code> per web page tells you which of them pages went via Azure and how one can invoice the parsing time.<\/li>\n<\/ul>\n5. Price and latency<\/h2>\nAzure shouldn’t be free. Three numbers matter.<\/p>\n Latency<\/strong>: one web page via prebuilt-layout<\/code> returns in 2 to 4 seconds. A 30-page doc takes 60 to 120 seconds. Fitz parses the identical doc in beneath a second. When the person is ready for a question, parse with fitz first. Escalate to Azure solely on pages fitz dealt with poorly.<\/p>\n Cash<\/strong>: Azure costs per web page. The prebuilt-layout<\/code> tier is round US$10 per 1,000 pages as we speak. A 30-page contract prices roughly US$0.30. Parsing 1,000 such contracts a day is US$300\/day if each web page goes via Azure. Limiting Azure to the pages that want it brings this down by 10x or extra.<\/p>\n Limits<\/strong>: the per-call PDF measurement restrict is 500 MB or 2,000 pages, whichever comes first. Bigger paperwork should be break up. The free tier (F0) permits 500 pages per thirty days and is ok for growth. Manufacturing often wants S0.<\/p>\n The order of magnitude is secure: fitz is free, Azure prices roughly a cent per web page. The precise tier costs change with area and time: deal with the numbers above as a calibration, not a contract. Article 10 picks which engine runs.<\/p>\n6. When to name which<\/h2>\nDefault to fitz. Escalate to Azure when a particular sign says fitz shouldn’t be sufficient.<\/p>\n Three indicators value wiring:<\/p>\n\nThe web page has a desk area however fitz extracted few or no row-like constructions.<\/strong> Compute on line_df<\/code>: cluster traces by y-coordinate, search for runs of brief uniform-spaced traces (an indication of cells). If the web page metadata says \u201cdesk detected\u201d (from fitz\u2019s web page.find_tables()<\/code>) however the line sample doesn’t look table-like, escalate.<\/li>\n The web page is image-heavy with sparse textual content.<\/strong> image_df<\/code> for the web page covers greater than 80% of the web page space and line_df<\/code> has fewer than 10 rows on that web page. Scanned web page with no OCR layer, or a web page that’s one massive diagram with textual content inside. Both case wants Azure.<\/li>\nThe OCR high quality rating is low:<\/strong> When fitz\u2019s web page.get_text(\"textual content\")<\/code> returns scrambled OCR (excessive ratio of Unicode alternative characters, low dictionary-word ratio), re-OCR with Azure. The text_quality_score<\/code> is computed in pre_parse_signals<\/code> and browse by the dispatcher.<\/li>\n<\/ol>\nA fourth sign is easier. If the doc has no native TOC (fitz.toc_df.empty<\/code>) and technology wants part context, run the doc as soon as via Azure to get a reconstructed TOC. One price per doc, not per question.<\/p>\n Article 10 builds the complete dispatcher. The parsing_method<\/code> column is what lets each downstream stage learn which engine ran on which row.<\/p>\n7. Conclusion<\/h2>\nTwo engines, one contract: the identical relational tables out, identical downstream code no matter which one ran.<\/p>\nEach functionality that issues for enterprise RAG, plus velocity and value \u2013 Picture by creator<\/em><\/figcaption><\/figure>\nA parser doesn’t return textual content; it returns a mannequin of the doc. Azure makes that mannequin richer (cell-level tables, OCR inside figures, captions tagged by function, TOC reconstructed with out bookmarks) at 2 to 4 seconds and ~US$0.01 per web page. Fitz prices nothing and runs in milliseconds. The routing rule is easy: fitz by default, Azure when an upstream sign says fitz shouldn’t be sufficient. Article 10 wires the dispatcher.<\/p>\n8. Sources and additional studying<\/h2>\nThe prebuilt-layout<\/code> mannequin behind parse_pdf_azure_layout<\/code> is documented by Microsoft and rests on cell-level desk extraction analysis (Smock et al.\u00a02022) plus a paragraph-role layer that converts visible areas into structural roles. Docling (Article 5ter) is the open-source equal of the identical cascade; it provides the identical desk contract on native {hardware}, helpful when paperwork can not depart the constructing.<\/p>\n Identical route because the article:<\/strong><\/p>\n\nMicrosoft, Azure AI Doc Intelligence. Structure mannequin<\/a><\/em>. Official documentation for prebuilt-layout<\/code>, the mannequin behind parse_pdf_azure_layout<\/code>. The cell-level desk output, paragraph roles, and OCR protection all originate right here.<\/li>\n Smock, Pesala, Abraham, PubTables-1M \/ Desk Transformer (TATR)<\/em>, CVPR 2022 (arXiv:2110.00061<\/a>). The analysis behind the cell-level desk extraction Azure ships; helpful for understanding what azure_layout<\/code> is doing beneath the hood.<\/li>\n<\/ul>\nCompletely different angle, totally different context:<\/strong><\/p>\n\nAuer et al., Docling Technical Report<\/em>, IBM Analysis 2024 (arXiv:2408.09869<\/a>). Open-source native equal of the Azure structure cascade. Identical desk contract; trades cloud price for native compute. The suitable selection when confidentiality blocks the cloud add that Azure requires.<\/li>\n<\/ul>\nEarlier within the collection:<\/strong><\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":" companion in Enterprise Doc Intelligence, the collection that builds an enterprise RAG system from 4 bricks. Article 5 (doc parsing) constructed the parser with PyMuPDF (fitz). This companion retains the identical objective and the identical relational tables, and swaps the engine for Azure Structure (the prebuilt-layout mannequin), a richer package deal that recovers what fitz […]<\/p>\n","protected":false},"author":2,"featured_media":15690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[1126,2977,9408,4114,9406,1729,9407],"class_list":["post-15688","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-azure","tag-layout","tag-parse","tag-pdfs","tag-pymupdf","tag-rag","tag-table"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15688"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688\/revisions"}],"predecessor-version":[{"id":15689,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15688\/revisions\/15689"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15690"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15688"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15688"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15688"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}