Construct interactive PDF textual content extraction from Amazon S3

Image this: a compliance officer wants a selected clause throughout an audit, an legal professional wants contract phrases whereas a shopper waits on the telephone, or a finance analyst wants numbers from final quarter’s report earlier than a gathering that begins in 10 minutes. In every case, ready for a scheduled job to complete shouldn’t be sensible. You want on-demand entry to the textual content inside your PDFs.

On this publish, you’ll construct a server that extracts textual content from PDF recordsdata in Amazon S3 in actual time. This protocol-based strategy offers programmatic doc entry. You’ll stroll by means of the structure, arrange the server, and run interactive doc queries. Alongside the way in which, you’ll examine this strategy with Amazon Textract so you possibly can resolve which software matches your workload.

We constructed this resolution after working with a number of groups who shared the identical frustration: their paperwork lived in Amazon S3, however getting textual content out of them on demand meant both writing customized scripts or ready on batch pipelines. This MCP server strategy sits in between, providing you with interactive entry with minimal setup. Interactive PDF textual content extraction from Amazon S3 provides you real-time solutions out of your paperwork with out batch pipelines or heavy infrastructure.

This MCP-based choice works nicely for text-based PDFs in improvement and proof of idea settings. For advanced doc processing like optical character recognition (OCR), type extraction, and format evaluation, Amazon Textract stays the advisable selection.

Who advantages from this strategy

This resolution matches a number of frequent roles. If these eventualities sound like your day-to-day, learn on.

Compliance and authorized groups: Throughout a time-sensitive overview, you could find a selected clause buried in a 200-page coverage doc or contract. Looking out manually takes too lengthy. With this resolution, you ask a query in pure language and get the related passage again in seconds.

Monetary providers groups: Throughout an audit session, you want quick entry to the precise wording of an inner danger coverage or regulatory submitting. This resolution helps you to pull that data immediately out of your Amazon S3 doc repository with out leaving your terminal.

Govt groups: Throughout strategic planning conferences, you possibly can question a PDF on the spot when somebody asks a couple of knowledge level from final quarter’s earnings report. No flipping by means of printed copies or ready for somebody to look it up after the assembly.

These eventualities share a number of frequent traits: they contain real-time data wants the place batch processing is just too sluggish, text-based PDF paperwork with commonplace formatting, price sensitivity in improvement and proof of idea environments, and integration necessities with current AWS workflows and tooling.

Amazon Textract is a totally managed AWS AI service purpose-built for doc processing at scale. It handles scanned pages, handwriting, and multi-column layouts. Select Amazon Textract while you want OCR for scanned paperwork, superior type and desk extraction, advanced format evaluation, production-scale batch processing with service degree settlement (SLA) necessities, or compliance options and enterprise help.

The MCP-based strategy addresses a complementary state of affairs: giving an AI assistant interactive, on-demand entry to textual content already encoded inside PDFs. Select this sample when your paperwork are text-based PDFs (no OCR required), your workflow is interactive relatively than batch, you might be working in improvement or proof of idea environments, and also you need minimal infrastructure between the AI assistant and the supply doc. For the whole lot else, together with any doc processing that advantages from OCR or structured extraction, route the work to Amazon Textract.

How the answer works

With this resolution, you join your AI assistant on to your PDF paperwork in Amazon S3 and might get solutions shortly. Beneath the hood, the answer makes use of the Mannequin Context Protocol (MCP), an open commonplace that gives a structured method to entry exterior knowledge sources. MCP acts as a communication layer between your software and your knowledge. The structure has 4 parts: a command-line interface because the consumer interface, the MCP layer for communication, a customized MCP server for PDF processing, and Amazon S3 for doc storage, secured by AWS Identification and Entry Administration (AWS IAM).

Value comparability

Select the strategy that matches your price range and necessities. For about 10,000 text-based PDF pages per 30 days in a proof of idea setting, right here is how the 2 approaches examine:

These two figures are worth factors for various characteristic units and shouldn’t be learn as a head-to-head worth comparability. Use them to choose the appropriate software for the workload, to not optimize purely on {dollars}. In case your workload entails scanned paperwork, types, tables, advanced layouts, or manufacturing SLAs, Amazon Textract is the suitable selection and the extra capabilities are mirrored in its worth.

Amazon Textract scope: page-level processing, OCR-ready, type and desk extraction, format understanding, enterprise SLAs

Indicative month-to-month price: Amazon Textract processing roughly $15, Amazon S3 storage $2, AWS Lambda compute $1, and huge language mannequin (LLM) token processing roughly $5 to $10, for a complete of roughly $23 to $28.

MCP server scope: direct textual content extraction from PDFs whose textual content is already encoded; no managed processing service concerned

Indicative month-to-month price: Amazon S3 storage $2 and knowledge switch $0.50, for a complete of roughly $2.50.

All price figures are illustrative and will change. Consult with the official AWS pricing pages for present charges.

Structure overview

The next sequence diagram illustrates the end-to-end workflow for extracting textual content from a PDF saved in Amazon S3. The method begins when the AI shopper initiates a request for PDF extraction by means of the CLI. The system forwards this request to the MCP server, which retrieves the PDF file from Amazon S3 utilizing the supplied bucket and object key.

After the MCP server fetches the PDF, it passes the file to a PDF parsing element. The element processes the doc and extracts the textual content material. The MCP server then returns the extracted textual content to the shopper, and the shopper shows it to the consumer.

Step-by-step implementation

Observe these steps to arrange and configure the PDF textual content extraction resolution. Start by confirming you will have the required stipulations in place.

Stipulations

Earlier than you start, verify that you’ve got the next objects prepared. You’ll additionally want primary familiarity with Python programming and AWS providers.

An AWS account with Amazon S3 learn permissions.
Python 3.10 or later put in.
AWS Command Line Interface (AWS CLI) configured with legitimate credentials.
Kiro CLI put in.
```
pip set up boto3 PyPDF2 mcp
```

Set up

This part guides you thru putting in the MCP server and its dependencies. The method entails making a Python digital setting, putting in the required packages, and creating the server file. Observe these steps so as. Run every command in your terminal.

Earlier than you begin, you want:

Python 3.10 or newer put in in your machine.
The Kiro CLI put in and logged in.
AWS credentials arrange in your machine (run aws configure should you haven’t).
An S3 bucket that incorporates not less than one PDF file.

Step 1 — Create a folder for the challenge

Run these two instructions in your terminal:

Step 2 — Navigate to the challenge folder

Run this command:

Step 3 — Create a Python digital setting

Run this command:

Step 4 — Activate the digital setting

Run this command:

After this, your terminal immediate will present (venv) initially. Maintain this terminal open. You should keep on this digital setting for the following steps.

Step 5 — Set up the required Python packages

Run this one command:

pip set up mcp boto3 PyPDF2

Await it to complete. It ought to finish with “Efficiently put in…”.

Step 6 — Create the server file

Contained in the ~/s3-pdf-extractor folder, create a brand new file named precisely:

Paste the next code into that file and put it aside:

Step 7 — Check that the server begins

In your terminal (nonetheless contained in the s3-pdf-extractor folder with the venv lively), run:

python s3_pdf_extractor.py

The terminal will seem to “pause” with no output. That’s right. It means the server is operating and ready for requests. Press Ctrl+C to cease it.

In the event you see an error as a substitute, re-check Steps 2 and three.

from mcp.server import Server
from mcp.sorts import Software, TextContent
import boto3
from PyPDF2 import PdfReader
import tempfile
import os
import logging

# Configure logging for manufacturing use
logging.basicConfig(degree=logging.INFO)
logger = logging.getLogger(__name__)

server = Server("s3-pdf-extractor")

@server.list_tools()
async def list_tools():
    return [
        Tool(
            name="extract_s3_pdf_text",
            description="Extract text content from a PDF stored in Amazon S3",
            inputSchema={
                "type": "object",
                "properties": {
                    "bucket": {"type": "string", "description": "S3 bucket name"},
                    "key": {"type": "string", "description": "S3 object key"}
                },
                "required": ["bucket", "key"]
            }
        )
    ]

@server.call_tool()
async def call_tool(title: str, arguments: dict):
    if title == "extract_s3_pdf_text":
        bucket = arguments["bucket"]
        key = arguments["key"]

        strive:
            # Use current AWS credentials and IAM permissions
            s3_client = boto3.shopper('s3')

            with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
                s3_client.download_file(bucket, key, tmp_file.title)
                tmp_path = tmp_file.title

            # Extract textual content utilizing PyPDF2
            reader = PdfReader(tmp_path)
            textual content = ""
            for web page in reader.pages:
                textual content += web page.extract_text() + "n"

            logger.information(f"Efficiently extracted textual content from {bucket}/{key}")
            return [TextContent(type="text", text=text)]

        besides Exception as e:
            logger.error(f"Error processing {bucket}/{key}: {str(e)}")
            increase
        lastly:
            # Guarantee cleanup of non permanent recordsdata
            if 'tmp_path' in locals():
                os.unlink(tmp_path)

if __name__ == "__main__":
    server.run()

Step 8 — Find or create the Kiro CLI configuration file

Kiro CLI makes use of a JSON configuration file to know which MCP servers can be found. You should add your server to this file.

The Kiro CLI MCP configuration file is situated at:

~/.kiro/settings/instruments/mcp.json

If this file doesn’t exist, create it by operating these instructions in your terminal:

mkdir -p ~/.kiro/settings/instruments
nano ~/.kiro/settings/instruments/mcp.json

Step 9 — Add the MCP server configuration

Paste the next JSON into the file. Change /path/to/s3_pdf_extractor.py with the precise path from Step 1 (for instance, ~/s3-pdf-extractor/s3_pdf_extractor.py):

{
    "mcpServers": {
        "s3-pdf-extractor": {
            "command": "python",
            "args": ["/path/to/s3_pdf_extractor.py"]
        }
    }
}

To get the complete absolute path, run echo ~/s3-pdf-extractor/s3_pdf_extractor.py in your terminal and use that output within the args subject.

Step 10 — Save the configuration file

Press Ctrl+O, then press Enter to avoid wasting the file.

Step 11 — Shut the file editor

Press Ctrl+X to exit nano.

Step 12 — Restart Kiro CLI

Restart Kiro CLI to load the brand new configuration. Shut and reopen Kiro CLI, or run:

Step 13 — Confirm the MCP server connection

Confirm the connection by operating a check extraction in Kiro CLI:

extract textual content from s3://your-bucket-name/pattern.pdf

Safety issues

Safety is built-in from the start, not added as an afterthought. Right here is how the answer handles it:

IAM integration: The answer makes use of your current AWS credentials. You don’t want to create or handle separate API keys.
Least privilege entry: You grant solely Amazon S3 learn permissions, scoped to the precise buckets that include your PDF paperwork. Nothing extra.
Momentary storage: The server deletes downloaded recordsdata mechanically after it completes processing. No PDF knowledge lingers on the native file system.
No knowledge persistence: Textual content extraction happens on demand with out storing outcomes.
Audit path: AWS CloudTrail logs Amazon S3 entry requests to your account.

Efficiency and limitations

Right here is what to anticipate when it comes to efficiency:

The server processes paperwork in actual time. For a typical 50-page text-based PDF, outcomes are typically obtainable in a number of seconds, making it sensible for interactive workflows the place you ask follow-up questions.
Processing time scales linearly with doc measurement. A ten-page doc processes roughly 5 instances sooner than a 50-page one.
Reminiscence utilization is proportional to doc measurement. For many text-based PDFs beneath 100 pages, reminiscence consumption stays nicely inside typical improvement machine limits.

This strategy has clear limits. Know them earlier than you commit:

Textual content-based PDFs solely. In case your paperwork are scanned photos or images of paper, the server can not learn them. Amazon Textract handles these instances natively with OCR.
No OCR functionality. The server reads embedded textual content from the PDF file format. It can not interpret pixels in a picture.
Restricted format understanding. The server performs simple textual content extraction. It doesn’t reconstruct tables, columns, or advanced web page layouts. Amazon Textract handles this natively.
No type processing. In case your PDFs include fillable type fields or structured knowledge, the server doesn’t extract these components. Amazon Textract handles this natively.

Actual-world use instances

These capabilities translate immediately into measurable outcomes throughout industries. Whether or not it’s authorized groups retrieving contract clauses mid-call, compliance officers finding coverage language throughout audits, or executives pulling earnings knowledge in actual time, the answer removes the friction of handbook doc search. The next examples present how completely different groups put it to work.

Authorized providers agency

A mid-sized authorized agency adopted this resolution for contract overview. Their attorneys used to spend 15 to twenty minutes looking by means of PDF contracts to seek out particular indemnification clauses throughout shopper calls. That meant placing the shopper on maintain or promising to name again later. Now they kind a query into Kiro CLI and get the related passage in seconds. The agency experiences that analysis time throughout shopper calls was considerably decreased.

Monetary providers compliance

A regional financial institution deployed the answer for regulatory examinations. Throughout audits, compliance officers must find particular coverage language shortly. Beforehand, they bookmarked key sections manually throughout dozens of PDF recordsdata, which was error-prone and arduous to keep up as insurance policies modified. With the MCP server related to their S3 doc repository, they now pull up the precise paragraph an examiner asks about in actual time.

Company technique staff

An enterprise management staff makes use of the answer throughout quarterly technique conferences. When a board member asks a couple of particular metric from the earlier quarter’s earnings report, the staff queries the PDF on the spot as a substitute of flipping by means of printed copies. This retains discussions transferring and grounded in precise knowledge.

Scaling and enhancement choices

This resolution is a place to begin. As your wants develop, you possibly can lengthen it. Begin with caching in case your staff accesses the identical paperwork repeatedly. Contemplate batch processing when you could deal with a whole lot of paperwork without delay. Add vector search when key phrase matching is now not enough.

Particularly, you possibly can lengthen the answer in these methods:

Add caching with Amazon DynamoDB for steadily accessed paperwork.
Implement batch processing with Amazon Easy Queue Service (Amazon SQS) for bulk operations.
Combine vector search with Amazon OpenSearch Service for semantic doc discovery.
Create hybrid workflows that route advanced paperwork to Amazon Textract mechanically.
Add monitoring with Amazon CloudWatch to trace utilization patterns and error charges.

Cleanup

If you’re executed testing or need to take away the answer, observe these steps to keep away from pointless prices.

Cease the MCP ServerPress Ctrl+C within the terminal the place the server is operating.
Take away the MCP ConfigurationOpen your Kiro CLI MCP configuration file (~/.kiro/settings/instruments/mcp.json) and delete the s3-pdf-extractor entry. Save and shut the file.
Delete the challenge recordsdataTake away the challenge listing and all its contents:
```
rm -rf ~/s3-pdf-extractor
```
Warning: This command completely deletes all recordsdata within the listing with out affirmation. Ensure you have saved any modifications earlier than continuing.
Clear up S3 assets (elective)In the event you created check PDFs in Amazon S3 particularly for this walkthrough, delete the check recordsdata or the check bucket utilizing the Amazon S3 console or the AWS CLI:
```
aws s3 rm s3://your-bucket-name/test-file.pdf
```
Solely delete assets you created for testing.
Evaluation IAM permissions (elective)Navigate to the IAM console and take away any S3 learn permissions added particularly for this resolution. Maintain permissions that different workflows rely on.
Confirm cleanupVerify the listing now not exists:
Anticipated output: No such file or listing

After cleanup, you’ll now not incur S3 storage and knowledge switch expenses for the assets you deleted. For detailed pricing data, see Amazon S3 Pricing. If you wish to redeploy later, repeat the set up steps. All code and configuration examples stay on this doc.

Conclusion

On this publish, you constructed an MCP server that extracts textual content from PDF recordsdata in Amazon S3 in actual time. You walked by means of the structure, in contrast prices with Amazon Textract, and noticed how 3 completely different groups put this strategy to work. The sample follows a transparent strategy: join your AI assistant to your paperwork, maintain the infrastructure minimal, and scale up solely when the workload calls for it.

In abstract, the MCP server sample is a targeted, interactive complement to Amazon Textract. Use it when an AI assistant must learn text-based PDFs in actual time. When your wants embrace OCR, types, tables, or production-scale processing, Amazon Textract is the AWS service designed for that work, and the 2 approaches match cleanly collectively. That is precisely the sample proven within the hybrid workflow choice earlier on this publish.

Subsequent steps:

Consider your use case in opposition to the standards within the “The place this strategy matches alongside Amazon Textract” part.
Deploy the answer in your improvement setting by following the Set up part on this publish. Check with 5 to 10 consultant paperwork to ascertain baseline efficiency.
Discover Amazon Textract for OCR capabilities, or study extra about Kiro CLI integration as your necessities evolve.
In the event you do that resolution or adapt it to your personal use case, we’d love to listen to about it within the feedback.

To study extra, discover the next assets: