{"id":14745,"date":"2026-05-14T03:51:19","date_gmt":"2026-05-14T03:51:19","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=14745"},"modified":"2026-05-14T03:51:20","modified_gmt":"2026-05-14T03:51:20","slug":"i-constructed-the-similar-b2b-doc-extractor-twice-guidelines-vs-llm","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=14745","title":{"rendered":"I Constructed the Similar B2B Doc Extractor Twice: Guidelines vs. LLM"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p class=\"wp-block-paragraph\"> state of affairs: You&#8217;re employed within the operations workforce of a medium-sized firm. On daily basis, your workforce processes order types from totally different B2B clients. All of them arrive as PDFs. And in idea, all of them include the identical data: buyer ID, buy order quantity, supply date, and the ordered objects.<\/p>\n<p class=\"wp-block-paragraph\">In apply, nonetheless, each doc appears barely totally different: One buyer locations the acquisition order quantity within the top-left nook, the subsequent one within the bottom-right nook. Some write \u201cPO Quantity\u201d, others use \u201cOrder ID\u201d, \u201cOrder Reference\u201d, or one thing fully totally different.<\/p>\n<p class=\"wp-block-paragraph\">For us people, that is often not an issue. We have a look at the doc, perceive the context, and instantly acknowledge which data is supposed.<\/p>\n<p class=\"wp-block-paragraph\">For conventional automation methods, nonetheless, this turns into tough: A regex rule can particularly seek for <em>\u201cPO Quantity: \u201c<\/em>. However what occurs if the subsequent buyer makes use of <em>\u201cOrder Reference: \u201c<\/em> as a substitute?<\/p>\n<p class=\"wp-block-paragraph\">That&#8217;s precisely the issue I recreated for this text.<\/p>\n<p class=\"wp-block-paragraph\">We examine two totally different approaches for extracting structured information from B2B order types:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">A conventional rule-based method utilizing <em>pytesseract<\/em> and regex guidelines<\/li>\n<li class=\"wp-block-list-item\">An LLM-based method utilizing <em>pytesseract<\/em>, <em>Ollama<\/em>, and <em>LLaMA 3<\/em><\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">The purpose of this text is to not present that LLMs are typically higher. They don&#8217;t seem to be all the time.<\/p>\n<p class=\"wp-block-paragraph\">A way more attention-grabbing query is: At what level do conventional extraction pipelines begin to attain their limits as complexity and the variety of totally different layouts enhance? And when can an LLM truly scale back upkeep effort?<\/p>\n<p class=\"wp-block-paragraph\"><em><strong>Desk of Contents<\/strong><br \/><a rel=\"nofollow\" target=\"_blank\" href=\"#pagejump1\" data-type=\"internal\" data-id=\"#pagejump1\">1 \u2013 Step-by-Step Information<\/a><br \/><a rel=\"nofollow\" target=\"_blank\" href=\"#pagejump2\" data-type=\"internal\" data-id=\"#pagejump2\">2 \u2013 Head-to-Head Comparability<\/a><br \/><a rel=\"nofollow\" target=\"_blank\" href=\"#pagejump3\" data-type=\"internal\" data-id=\"#pagejump3\">3 \u2013 When ought to we NOT use an LLM?<\/a><br \/><a rel=\"nofollow\" target=\"_blank\" href=\"#pagejump4\" data-type=\"internal\" data-id=\"#pagejump4\">4 \u2013 Closing Ideas<\/a><br \/><a rel=\"nofollow\" target=\"_blank\" href=\"#pagejump5\" data-type=\"internal\" data-id=\"#pagejump5\">The place to Proceed Studying?<\/a><\/em><\/p>\n<h2 class=\"wp-block-heading\" id=\"pagejump1\">1 \u2013 Step-by-Step Information<\/h2>\n<p class=\"wp-block-paragraph\">We rebuild each approaches step-by-step. First, we create two pattern PDFs containing the identical enterprise data however utilizing totally different layouts. Afterwards, we extract the info as soon as with a conventional OCR and regex pipeline and as soon as with an OCR and LLM pipeline. This enables us to match each approaches beneath equivalent situations.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The normal method principally asks:<br \/>\u201cCan I discover the precise sample that I programmed?\u201d<\/li>\n<li class=\"wp-block-list-item\">The LLM-based method as a substitute asks:<br \/>\u201cCan I perceive the that means of this discipline in context?\u201d<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>\u2192 \ud83e\udd13 <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/Sari95\/ocr-pdf-extraction-regex-vs-llm\">Discover the total code within the GitHub Repo<\/a> \ud83e\udd13 \u2190<\/strong><\/p>\n<h3 class=\"wp-block-heading\">Earlier than We Begin \u2014 Mise en Place<\/h3>\n<p class=\"wp-block-paragraph\"><strong>pip vs. Anaconda<\/strong><\/p>\n<p class=\"wp-block-paragraph\">On this information, we use <em>pip<\/em>, Python\u2019s commonplace package deal supervisor. This implies we set up all libraries instantly by the command line utilizing <em>pip set up \u2026<\/em>. <em>pip<\/em> is already included robotically if you set up Python. If you already know Python tutorials that work with <em>Anaconda<\/em>, that&#8217;s merely one other solution to obtain the identical purpose (utilizing <em>conda set up \u2026<\/em>). Within the article <em>\u201c<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/python.plainenglish.io\/python-data-analysis-ecosystem-a-beginners-roadmap-adf22ba20ed2?sk=6e419d9aecd6ae760c42764d1d5b2ef4\">Python Knowledge Evaluation Ecosystem \u2014 A Newbie\u2019s Roadmap<\/a>\u201d<\/em>, you&#8217;ll find additional particulars about getting began with Python. Moreover, on a Microsoft system we use the CMD terminal (Home windows key + R &gt; click on on cmd).<\/p>\n<p class=\"wp-block-paragraph\"><strong>Create and activate a brand new digital surroundings<\/strong><br \/>Create a brand new python surroundings with <code>python \u2013m venv b2bdocumentextractor<\/code> (you possibly can change the title) in a terminal and activate it with<code>b2bdocumentextractorScriptsactivate<\/code>.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Optionally available: Verify Python and pip<\/strong><\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">python --version\npip --version<\/code><\/pre>\n<p class=\"wp-block-paragraph\">It is best to see a Python and a pip model.<\/p>\n<h4 class=\"wp-block-heading\">Step 1 \u2013 Set up Tesseract<\/h4>\n<p class=\"wp-block-paragraph\"><em>Tesseract<\/em> is the OCR engine. It&#8217;s the software that truly reads textual content from pictures or scanned PDFs utilizing OCR (Optical Character Recognition). <em>pytesseract<\/em> is barely the Python bridge to Tesseract. This implies: Our Python code can talk with Tesseract by <em>pytesseract<\/em>, however the actual textual content recognition is finished by Tesseract itself. With out putting in <em>Tesseract<\/em> first, pytesseract can&#8217;t work.<\/p>\n<p class=\"wp-block-paragraph\">First, we obtain the most recent .exe-file for w64 and run the installer:<br \/><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/UB-Mannheim\/tesseract\/wiki?utm_source=chatgpt.com\">GitHub \u2013 Tesseract at UB Mannheim<\/a><\/p>\n<p class=\"wp-block-paragraph\">Necessary: Keep in mind the set up path:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">C:Program FilesTesseract-OCR<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Contained in the CMD terminal, we confirm the set up utilizing the next command:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">\"C:Program FilesTesseract-OCRtesseract.exe\" --version<\/code><\/pre>\n<p class=\"wp-block-paragraph\">If every part labored accurately, we must always see the corresponding Tesseract model.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Tesseract-Installation-1024x375.png\" alt=\"This screenshot shows the terminal when the Tesseract Download was successful.\" class=\"wp-image-659063\"\/><\/figure>\n<h4 class=\"wp-block-heading\">Step 2 \u2013 Set up Poppler<\/h4>\n<p class=\"wp-block-paragraph\">Subsequent, we set up <em>pdf2image<\/em>. That is our library for changing PDFs into pictures and it requires <em>Poppler<\/em> within the background. <em>Poppler<\/em> is an open-source PDF rendering library used to show PDF information.<\/p>\n<p class=\"wp-block-paragraph\">For this, we obtain the most recent model of <em>Poppler<\/em>, extract the ZIP file, and transfer the extracted folder to the <code>C:<\/code> drive.<br \/><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/oschwartz10612\/poppler-windows\/releases?utm_source=chatgpt.com\">GitHub-Poppler Home windows Releases<\/a><\/p>\n<p class=\"wp-block-paragraph\">Contained in the folder, click on on <em>Library &gt; bin<\/em> and save the trail the place you saved the folder in your C: drive. On my machine, it appears like this:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">C:Usersschuepoppler-26.02.0Librarybin<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Moreover, we add the trail to the PATH variable so Home windows is aware of the place <em>Poppler<\/em> is situated.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Trace for Newbies:<\/strong><br \/>Press the Home windows key and seek for <em>Edit surroundings variables<\/em>. Afterwards click on on <em>Edit the system surroundings variables<\/em>. Then click on on <em>Surroundings Variables<\/em>. Underneath <em>Person variables<\/em>, choose the variable PATH, click on on <em>Edit<\/em>, then <em>New<\/em>, and paste the trail.<\/p>\n<p>Now restart CMD so the modifications are utilized.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Adding-PATH-on-Windows-1024x460.png\" alt=\"This screenshot shows how you can add a PATH Variable on Windows.\" class=\"wp-image-659064\"\/><\/figure>\n<h4 class=\"wp-block-heading\">Step 3 \u2013 Set up Python Libraries<\/h4>\n<p class=\"wp-block-paragraph\">Now we set up all Python libraries we&#8217;d like. Be sure to reactivate the Python surroundings beforehand:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong><em>pytesseract<\/em>:<\/strong> We set up this library because the bridge between Python and <em>Tesseract<\/em>. We already put in <em>Tesseract<\/em> because the OCR engine, however solely with <em>pytesseract<\/em> can Python talk with it instantly.<\/li>\n<li class=\"wp-block-list-item\"><strong><em>pdf2image:<\/em><\/strong><em> pytesseract<\/em> is an OCR engine, which implies it acknowledges textual content from pixels in a picture. It can&#8217;t learn PDF buildings instantly. <em>pdf2image<\/em> subsequently performs an intermediate step: It renders every PDF web page as a picture, just like a screenshot, in order that <em>pytesseract<\/em> can analyze it afterwards. Observe: If we had digital PDFs (that means PDFs the place you possibly can choose and replica textual content), we might instantly extract the textual content utilizing libraries comparable to <em>pdfplumber<\/em> or <em>PyMuPDF<\/em>. Nevertheless, since we assume that B2B order types are sometimes scans in apply, we take the detour by <em>pdf2image<\/em>.<\/li>\n<li class=\"wp-block-list-item\"><strong><em>pillow:<\/em><\/strong><em> pdf2image<\/em> and <em>pytesseract<\/em> use this image-processing library within the background (we don&#8217;t instantly see the utilization within the code) to accurately course of pictures.<br \/><em>fpdf2<\/em>: We use this library to robotically generate two take a look at PDFs (Format A and Format B) by way of script for the article instance.<br \/><em>ollama<\/em>: This library permits our Python script to ship messages to the LLM and obtain responses.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Installing-Python-Libraries.png\" alt=\"This screenshot shows how you can install Python libraries.\" class=\"wp-image-659065\"\/><\/figure>\n<h4 class=\"wp-block-heading\">Step 4 \u2013 Set up Ollama and Obtain LLaMA 3<\/h4>\n<p class=\"wp-block-paragraph\">As soon as the set up of the libraries labored efficiently, we set up <em>Ollama<\/em> and <em>LLaMA 3<\/em> because the LLM. <em>Ollama<\/em> is the software that enables us to run LLMs fully free, domestically on our laptop computer, and with out API keys.<\/p>\n<p class=\"wp-block-paragraph\">First, we set up <em>Ollama<\/em>. When you&#8217;ve got not already finished this, you possibly can obtain the Home windows installer from <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ollama.com?utm_source=chatgpt.com\">Ollama<\/a> and execute it.<\/p>\n<p class=\"wp-block-paragraph\">Afterwards, we obtain <em>LLaMA 3<\/em> utilizing the next command:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">ollama pull llama3<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Relying in your web connection, this step might take a while since roughly 4.7 GB are downloaded. Nevertheless, we will see a progress bar within the terminal.<\/p>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Llama-Download.png\" alt=\"This screenshot shows the download of ollama.\" class=\"wp-image-659066\"\/><\/figure>\n<p class=\"wp-block-paragraph\">Afterwards, we confirm whether or not every part labored:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">ollama listing<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In the event you see one thing just like the screenshot, it labored efficiently.<\/p>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Llama-success.png\" alt=\"If the ollama download was successful, you can see it in your terminal.\" class=\"wp-image-659067\"\/><\/figure>\n<h4 class=\"wp-block-heading\">Step 5 \u2013 Create the Mission Folder and Generate Take a look at PDFs<\/h4>\n<p class=\"wp-block-paragraph\">For this comparability, we create two B2B order types for Alpha GmbH and Beta AG that include the identical data however use totally different layouts. On this instance, we assume that the order types are scans, which is why we beforehand put in <em>pdf2image<\/em> (for digital PDFs, this is able to even be doable with libraries comparable to <em>pdfplumber<\/em> or<em> PyMuPDF<\/em>).<\/p>\n<p class=\"wp-block-paragraph\">First, we create a venture folder to retailer all information there:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">mkdir document_extractor\ncd document_extractor<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Subsequent, we create a brand new file known as <em>create_test_pdfs.py<\/em> and insert the next code that you&#8217;ll find on this GitHub-Gist. We save this file contained in the beforehand created folder <em>document_extractor<\/em>:<\/p>\n<p class=\"wp-block-paragraph\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/gist.github.com\/Sari95\/a52a62eb78e0604c4d8c64f5cdd1160a\">https:\/\/gist.github.com\/Sari95\/a52a62eb78e0604c4d8c64f5cdd1160a<\/a><\/p>\n<p class=\"wp-block-paragraph\">Now we return to the terminal and execute the file:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">python create_test_pdfs.py<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Contained in the folder, we will now see the 2 newly created PDFs:<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Order-Forms-to-read-with-Regex-and-LLM-1024x580.png\" alt=\"This screenshot shows the 2 generated PDFs: One for Alpha GmbH and one for Beta AG.\" class=\"wp-image-659068\"\/><\/figure>\n<p class=\"wp-block-paragraph\">Within the two PDFs, we will already see the issue:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">They include the identical data.<\/li>\n<li class=\"wp-block-list-item\">However the PDFs use fully totally different discipline names and a distinct date format.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Method 1: The Conventional Approach (<em>pytesseract<\/em> + Regex Guidelines)<\/h3>\n<p class=\"wp-block-paragraph\">The normal method works in two steps:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">First, we convert the PDF into a picture. Afterwards, we use <em>pytesseract<\/em> to learn the picture and extract the uncooked textual content by way of OCR (Optical Character Recognition). Put merely, OCR implies that the software \u201cappears\u201d on the picture and tries to acknowledge letters from pixels. Fairly just like how people decipher handwritten notes.<\/li>\n<li class=\"wp-block-list-item\">Within the second step, we use regex. These are common expressions that seek for particular patterns contained in the textual content. For instance, we will outline: \u201cSeek for every part that comes after <code>PO Quantity:<\/code>.\u201d<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Already on this second step, we will establish the primary drawback: What occurs if the shopper merely writes \u201cOrder Reference\u201d as a substitute of \u201cPO Quantity: \u201c?<\/p>\n<p class=\"wp-block-paragraph\">In that case, the regex sample finds nothing. What we will then do (or should do) is add a brand new rule.<\/p>\n<h4 class=\"wp-block-heading\">Execute Script 1 for Method 1<\/h4>\n<p class=\"wp-block-paragraph\">Subsequent, we create a brand new file known as <em>approach1_traditional.py<\/em> with the next code that you&#8217;ll find within the GitHub-Gist inside the identical folder:<\/p>\n<p class=\"wp-block-paragraph\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/gist.github.com\/Sari95\/aa2be6938fbcb1c7f94b053d9046f55d\">https:\/\/gist.github.com\/Sari95\/aa2be6938fbcb1c7f94b053d9046f55d<\/a><\/p>\n<p class=\"wp-block-paragraph\">Now we execute the file once more contained in the terminal:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">python approach1_traditional.py<\/code><\/pre>\n<h4 class=\"wp-block-heading\">The Results of Method 1<\/h4>\n<p class=\"wp-block-paragraph\">For Format A, every part works completely:<\/p>\n<p class=\"wp-block-paragraph\">For Format B? Not a single discipline is acknowledged and all values return \u201cNone\u201d:<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Results-of-Approach-1-pytesseract-Regex-Rules-1024x770.png\" alt=\"It shows that with Regex Rules, it can read out the fields from Alpha GmbH perfectly, but it reads for Beta AG &quot;None&quot;.\" class=\"wp-image-659069\"\/><\/figure>\n<p class=\"wp-block-paragraph\">And that is precisely the place the issue lies. For each new buyer, new regex guidelines must be written, examined, and deployed. With 200 clients, which means 200 totally different patterns. And each time a buyer barely modifications their kind, the system breaks once more.<\/p>\n<h3 class=\"wp-block-heading\">Method 2: A New Approach (<em>pytesseract<\/em> +<em> Ollama<\/em> + <em>LLaMA 3<\/em>)<\/h3>\n<p class=\"wp-block-paragraph\">On this second method, we hold the OCR step, however exchange the inflexible regex guidelines with an LLM:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><em>pytesseract<\/em> nonetheless reads the textual content from the PDF.<\/li>\n<li class=\"wp-block-list-item\">As an alternative of telling the code \u201cSeek for PO Quantity: \u201d, we inform the LLM: \u201cRight here is an order doc. Extract these fields for me, no matter how they&#8217;re named.\u201d<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">The LLM understands the semantic context. It acknowledges that <em>\u201cOrder Reference\u201d<\/em> and <em>\u201cPO Quantity\u201d<\/em> imply the identical factor, even with out an express rule.<\/p>\n<h4 class=\"wp-block-heading\">Execute Script 2 for Method 2<\/h4>\n<p class=\"wp-block-paragraph\">Now, we create a brand new file known as <em>approach2_llm.py<\/em> with the next code that you&#8217;ll find within the GitHub-Gist inside the identical folder:<\/p>\n<p class=\"wp-block-paragraph\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/gist.github.com\/Sari95\/d4e9e83490a9fbf34a3776d1604f8742\">https:\/\/gist.github.com\/Sari95\/d4e9e83490a9fbf34a3776d1604f8742<\/a><\/p>\n<p class=\"wp-block-paragraph\">Now we execute the file once more contained in the terminal. Make it possible for Ollama continues to be working within the background:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">python approach2_llm.py<\/code><\/pre>\n<h4 class=\"wp-block-heading\">The Results of Method 2<\/h4>\n<p class=\"wp-block-paragraph\">What we will now see is that each layouts are accurately acknowledged:<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/Results-of-Approach-2-Results-of-Approach-1-pytesseract-Regex-Rules-1024x770.png\" alt=\"With a LLM, both Layouts can be read correctly.\" class=\"wp-image-659070\"\/><\/figure>\n<p class=\"wp-block-paragraph\">For each layouts, the data from the in a different way named fields is accurately extracted and assigned, although not a single regex expression was adjusted and no new template was created. The LLM understands each layouts as a result of it reads the context. Moreover, the date format from Format B is instantly normalized to match the format from Format A.<\/p>\n<h2 class=\"wp-block-heading\" id=\"pagejump2\">2 \u2013 Head-to-Head Comparability<\/h2>\n<p class=\"wp-block-paragraph\">After each exams, one factor rapidly turns into clear: Technically, each approaches clear up the identical drawback.<\/p>\n<p class=\"wp-block-paragraph\">Each approaches have their very own benefits and downsides:<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2026\/05\/LLM-vs-Regex-Comparison-1024x927.png\" alt=\"The table shows a comparison between the approach with Regex and the one with a LLM\" class=\"wp-image-659062\"\/><\/figure>\n<p class=\"wp-block-paragraph\">With regex-based pipelines, the complexity lives within the guidelines and upkeep effort. With LLM-based pipelines, the complexity shifts towards infrastructure, inference time, and mannequin conduct. For medium-sized firms processing many customer-specific layouts, that trade-off can develop into strategically extra vital than pure extraction accuracy.<\/p>\n<h2 class=\"wp-block-heading\" id=\"pagejump3\">3 \u2013 When ought to we NOT use an LLM?<\/h2>\n<p class=\"wp-block-paragraph\">In the meanwhile, it typically feels as if each current automation course of abruptly must be changed with AI or LLMs.<\/p>\n<p class=\"wp-block-paragraph\">In apply, nonetheless, this isn&#8217;t all the time the higher resolution. Particularly medium-sized firms often don&#8217;t must construct the \u201cmost trendy\u201d resolution, however relatively the one that continues to be secure, maintainable, and economically cheap in the long run. Relying on the state of affairs, that may be the standard regex-based method, whereas in different circumstances switching to an LLM might make extra sense.<\/p>\n<p class=\"wp-block-paragraph\">Some conditions the place the standard method should be the extra appropriate possibility:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>The paperwork are secure and standardized:<\/strong><br \/>If an organization solely processes just a few recognized layouts and these not often change, regex is commonly the higher resolution.\n<p>Why?<\/p>\n<p>As a result of the extra advantage of an LLM turns into small, whereas the general system complexity will increase.<\/p>\n<p>A secure rule-based course of, alternatively, is quicker, cheaper, simpler to debug, and simpler at hand over to new folks.<\/p>\n<\/li>\n<li class=\"wp-block-list-item\"><strong>Velocity and throughput are important:<\/strong><br \/>In our instance, the LLM processes one doc inside 20\u201340 seconds.\n<p>At first, that sounds acceptable. However as soon as we think about ourselves inside an actual manufacturing surroundings, the attitude modifications rapidly.<\/p>\n<p>A medium-sized firm most likely processes orders, supply notes, invoices, customs paperwork, help paperwork, and many others. And never 10 instances per day, however 10,000 instances per day.<\/p>\n<p>On this state of affairs, inference time abruptly turns into an actual infrastructure concern. Regex-based methods run considerably sooner, whereas LLMs require extra RAM, extra CPU\/GPU energy, and infrequently further queueing or batch-processing mechanisms.<\/p>\n<\/li>\n<li class=\"wp-block-list-item\"><strong>Explainability is extra vital than flexibility:<\/strong><br \/>Particularly in regulated industries comparable to pharma, insurance coverage, banking, or healthcare, it&#8217;s typically crucial to completely perceive why a selected worth was extracted.\n<p>Regex guidelines are clearly deterministic: One line of code produces one clearly explainable outcome. LLMs, alternatively, work probabilistically: The mannequin <em>interprets<\/em> the context and returns the almost definitely outcome. That is precisely what makes LLMs versatile, however on the identical time additionally harder to audit.<\/p>\n<\/li>\n<li class=\"wp-block-list-item\"><strong>The corporate doesn&#8217;t have the suitable infrastructure:<\/strong><br \/>In our instance, we used Ollama. Getting began was typically easy. However, it shouldn&#8217;t be underestimated that reminiscence consumption, GPU assets, monitoring, or response instances beneath load can look very totally different when working with LLMs.<\/li>\n<\/ol>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<p class=\"wp-block-paragraph\"><em>On my<\/em> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/sarahleaschrch.substack.com\/\"><em>Substack Knowledge Science Espresso<\/em><\/a><em>, I share sensible guides and bite-sized updates from the world of Knowledge Science, Python, AI, Machine Studying, and Tech \u2014 made for curious minds like yours.<\/em><\/p>\n<p class=\"wp-block-paragraph\"><em>Take a look and subscribe on <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/medium.com\/@schuerch_sarah\">Medium<\/a> or on Substack if you wish to keep within the loop.<\/em><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n<h2 class=\"wp-block-heading\" id=\"pagejump4\">4 \u2013 Closing Ideas<\/h2>\n<p class=\"wp-block-paragraph\">Choosing the proper method isn&#8217;t essentially a technical query, however relatively a strategic one.<\/p>\n<p class=\"wp-block-paragraph\">The normal method tries to explicitly describe each doable doc. The LLM-based method as a substitute tries to grasp that means and context. For small and secure environments, the standard method is commonly fully ample. The extra layouts and edge circumstances seem, the harder it turns into to maintain the foundations maintainable in the long run. That&#8217;s precisely the place LLMs begin to develop into attention-grabbing.<\/p>\n<p class=\"wp-block-paragraph\">It may also be an thrilling entry-level use case for a corporation to start out working with an LLM right here and, in doing so, make the corporate prepared for AI and acquire preliminary sensible expertise.<\/p>\n<h2 class=\"wp-block-heading\" id=\"pagejump5\">The place Can You Proceed Studying?<\/h2>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>state of affairs: You&#8217;re employed within the operations workforce of a medium-sized firm. On daily basis, your workforce processes order types from totally different B2B clients. All of them arrive as PDFs. And in idea, all of them include the identical data: buyer ID, buy order quantity, supply date, and the ordered objects. In apply, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":14747,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[3206,1007,4534,9063,74,4015],"class_list":["post-14745","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-b2b","tag-built","tag-document","tag-extractor","tag-llm","tag-rules"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14745","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=14745"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14745\/revisions"}],"predecessor-version":[{"id":14746,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14745\/revisions\/14746"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/14747"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=14745"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=14745"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=14745"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-06-15 09:27:16 UTC -->