$\"llamaindex_gemini-api$ <\/p>\n

Extracting textual content from unstructured paperwork is a traditional developer headache. For many years, conventional Optical Character Recognition (OCR) techniques have struggled with advanced layouts, usually turning multi-column PDFs, embedded photographs, and nested tables into an unreadable mess of plain textual content.<\/p>\n

As we speak, the multimodal capabilities of huge language fashions (LLMs) lastly make dependable doc understanding attainable.<\/p>\n

LlamaParse<\/a> bridges the hole between conventional OCR and vision-language agentic parsing. It delivers state-of-the-art textual content extraction throughout PDFs, displays, and pictures.<\/p>\n

On this publish, you’ll discover ways to use Gemini to energy LlamaParse, extract high-quality textual content and tables from unstructured paperwork, and construct an clever private finance assistant. As a reminder, Gemini fashions might make errors and shouldn’t be relied upon for skilled recommendation.<\/p>\n

Why LlamaParse?<\/b><\/h3>\n
In lots of circumstances, LLMs can already carry out this process successfully, nevertheless, when working with giant doc collections or extremely variable codecs, consistency and reliability can turn out to be more difficult.<\/p>\n
Devoted instruments like LlamaParse complement LLM capabilities by introducing preprocessing steps and customizable parsing directions, which assist construction advanced components comparable to giant tables or dense textual content. On the whole parsing benchmarks, this strategy has proven round a 13\u201315% enchancment in comparison with processing uncooked paperwork immediately.<\/p>\n
The use case: parsing brokerage statements<\/b><\/h3>\n
Brokerage statements characterize the final word doc parsing problem. They include dense monetary jargon, advanced nested tables, and dynamic layouts.<\/p>\n
To assist customers perceive their monetary scenario, you want a workflow that not solely parses the file, however explicitly extracts the tables and explains the information via an LLM.<\/p>\n
Due to these superior reasoning and multimodal necessities, Gemini 3.1 Professional is the right match because the underlying mannequin. It balances an enormous context window with native spatial format comprehension.<\/p>\n
The workflow operates in 4 levels:<\/p>\n
\n
Ingest:<\/b> You submit a PDF to the LlamaParse engine.<\/li>\n
Route:<\/b> The engine parses the doc and emits a `ParsingDoneEvent<\/code>.<\/li>\n`
`Extract:<\/b> This occasion triggers two parallel duties \u2014 textual content extraction and desk extraction \u2014 that run concurrently to reduce latency.<\/li>\n`
`Synthesize:<\/b> As soon as each extractions full, Gemini generates a human-readable abstract.<\/li>\n<\/ol>\nThis two-model structure is a deliberate design selection: Gemini 3.1 Professional<\/b> handles the exhausting layout-comprehension throughout parsing, whereas Gemini 3 Flash<\/b> handles the ultimate summarization \u2014 optimizing for each accuracy and value.<\/p>\n`You’ll find the entire code for this tutorial within the LlamaParse x Gemini demo GitHub repository<\/a>.<\/p>\n Establishing the atmosphere<\/b><\/h3>\nFirst, set up the required Python packages for LlamaCloud, LlamaIndex workflows, and the Google GenAI SDK.<\/p>\n<\/div>\n\n# with pip \npip set up llama-cloud-services llama-index-workflows pandas google-genai \n \n# with uv \nuv add llama-cloud-services llama-index-workflows pandas google-genai<\/code><\/pre>\n\n Shell\n <\/p>\n<\/div>\n\nSubsequent, export your API keys as atmosphere variables. Get a Gemini API key from AI Studio<\/a>, and a LlamaCloud API key from the console<\/a>. Safety Word:<\/b> By no means hardcode your API keys in your software supply code.<\/p>\n<\/div>\n \nexport LLAMA_CLOUD_API_KEY=\"your_llama_cloud_key\" \nexport GEMINI_API_KEY=\"your_google_api_key\"<\/code><\/pre>\n\n Shell\n <\/p>\n<\/div>\n\nStep 1: Create and use the parser<\/b><\/h3>\nStep one in your workflow is parsing. You create a LlamaParse shopper backed by Gemini 3.1 Professional and outline it in sources.py<\/a> so you may inject it into your workflow as a useful resource:<\/p>\n<\/div>\n \ndef get_llama_parse() -> LlamaParse: \n return LlamaParse( \n api_key=os.getenv(\"LLAMA_CLOUD_API_KEY\"), \n parse_mode=\"parse_page_with_agent\", \n mannequin=\"gemini-3.1-pro\", \n result_type=ResultType.MD, \n )<\/code><\/pre>\n\n Python\n <\/p>\n<\/div>\n\nThe parse_page_with_agent<\/code> mode applies a layer of agentic iteration guided by Gemini to right and format OCR outcomes based mostly on visible context.<\/p>\nIn workflow.py<\/a>, outline the occasions, state, and the parsing step:<\/p>\n<\/div>\n \nclass BrokerageStatementWorkflow(Workflow): \n @step \n async def parse_file( \n self, \n ev: FileEvent, \n ctx: Context[WorkflowState], \n parser: Annotated[LlamaParse, Resource(get_llama_parse)] \n ) -> ParsingDoneEvent | OutputEvent: \n consequence = forged(ParsingJobResult, (await parser.aparse(file_path=ev.input_file))) \n async with ctx.retailer.edit_state() as state: \n state.parsing_job_result = consequence \n return ParsingDoneEvent()<\/code><\/pre>\n\n Python\n <\/p>\n<\/div>\n\nDiscover that you don’t course of parsing outcomes instantly. As an alternative, you retailer them within the world WorkflowState<\/code> so they’re accessible for the extraction steps that observe.<\/p>\nStep 2: Extract the textual content and tables<\/b><\/h3>\nTo supply the LLM with the context required to elucidate the monetary assertion, it’s good to extract the total markdown textual content and the tabular knowledge. Add the extraction steps to your BrokerageStatementWorkflow<\/code> class (see the total implementation in workflow.py<\/a>):<\/p>\n<\/div>\n \n@step \nasync def extract_text(self, ev: ParsingDoneEvent, ctx: Context[WorkflowState]) -> TextExtractionDoneEvent: \n # Extraction logic omitted for brevity. See repo. \n \n@step \nasync def extract_tables(self, ev: ParsingDoneEvent, ctx: Context[WorkflowState], ...) -> TablesExtractionDoneEvent: \n # Extraction logic omitted for brevity. See repo.<\/code><\/pre>\n\n Python\n <\/p>\n<\/div>\n\nAs a result of each steps hear for a similar ParsingDoneEvent<\/code>, LlamaIndex Workflows robotically executes them in parallel<\/b>. This implies your textual content and desk extractions run concurrently \u2014 chopping total pipeline latency and making the structure naturally scalable as you add extra extraction duties.<\/p>\nStep 3: Generate the abstract<\/b><\/h3>\nWith the information extracted, you may immediate Gemini 3.1 Professional to generate a abstract in accessible, non-technical language.<\/p>\nConfigure the LLM shopper and immediate template in sources.py<\/a>. Right here, you employ Gemini 3 Flash<\/b> for the ultimate summarization, because it presents low latency and value effectivity for textual content aggregation duties.<\/p>\n The ultimate synthesis step makes use of ctx.collect_events<\/code> to attend for each extractions to finish earlier than calling the Gemini API.<\/p>\n<\/div>\n \n@step \nasync def ask_llm( \n self, \n ev: TablesExtractionDoneEvent | TextExtractionDoneEvent, \n ctx: Context[WorkflowState], \n llm: Annotated[GenAIClient, Resource(get_llm)], \n template: Annotated[Template, Resource(get_prompt_template)] \n) -> OutputEvent: \n if ctx.collect_events(ev, [TablesExtractionDoneEvent, TextExtractionDoneEvent]) is None: \n return None \n # Full immediate and LLM name accessible in repo.<\/code><\/pre>\n\n Python\n <\/p>\n<\/div>\n\nWorking the workflow<\/b><\/h3>\nTo tie all of it collectively, the essential.py<\/a> entry level creates and runs the workflow:<\/p>\n<\/div>\n \nwf = BrokerageStatementWorkflow(timeout=600) \nconsequence = await wf.run(start_event=FileEvent(input_file=input_file))<\/code><\/pre>\n\n Python\n <\/p>\n<\/div>\n\nTo check the workflow, obtain a pattern assertion from the LlamaIndex datasets:<\/p>\n<\/div>\n\ncurl -L https:\/\/uncooked.githubusercontent.com\/run-llama\/llama-datasets\/essential\/llama_agents\/bank_statements\/brokerage_statement.pdf > brokerage_statement.pdf<\/code><\/pre>\n\n Shell\n <\/p>\n<\/div>\n\n# Utilizing pip \npython3 essential.py brokerage_statement.pdf \n \n# Utilizing uv \nuv run run-workflow brokerage_statement.pdf<\/code><\/pre>\n\n Shell\n <\/p>\n<\/div>\n\nYou now have a completely useful private finance assistant operating in your terminal, able to analyzing advanced monetary PDFs.<\/p>\nSubsequent steps<\/b><\/h3>\nAI pipelines are solely pretty much as good as the information you feed them. By combining Gemini 3.1 Professional’s multimodal reasoning with LlamaParse’s agentic ingestion, you guarantee your functions have the total, structured context they want \u2014 not simply flattened textual content.<\/p>\n Once you base your structure on event-driven statefulness, just like the parallel extractions demonstrated right here, you construct techniques which are quick, scalable, and resilient. Double-check outputs earlier than counting on them.<\/p>\n Able to implement this in manufacturing? Discover LlamaParse<\/a> and the Gemini API documentation<\/a> to experiment with multimodal technology, and dive into the full code within the GitHub repository<\/a>.<\/p>\n<\/div><\/div>\n\n","protected":false},"excerpt":{"rendered":" Extracting textual content from unstructured paperwork is a traditional developer headache. For many years, conventional Optical Character Recognition (OCR) techniques have struggled with advanced layouts, usually turning multi-column PDFs, embedded photographs, and nested tables into an unreadable mess of plain textual content. As we speak, the multimodal capabilities of huge language fashions (LLMs) lastly make […]<\/p>\n","protected":false},"author":2,"featured_media":13024,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[122,73,94,295,8352,83],"class_list":["post-13022","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-assistant","tag-build","tag-financial","tag-gemini","tag-llamaparse","tag-smart"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13022","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13022"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13022\/revisions"}],"predecessor-version":[{"id":13023,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13022\/revisions\/13023"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/13024"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13022"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13022"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13022"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}