Extracting textual content from unstructured paperwork is a traditional developer headache. For many years, conventional Optical Character Recognition (OCR) techniques have struggled with advanced layouts, usually turning multi-column PDFs, embedded photographs, and nested tables into an unreadable mess of plain textual content.
As we speak, the multimodal capabilities of huge language fashions (LLMs) lastly make dependable doc understanding attainable.
LlamaParse bridges the hole between conventional OCR and vision-language agentic parsing. It delivers state-of-the-art textual content extraction throughout PDFs, displays, and pictures.
On this publish, you’ll discover ways to use Gemini to energy LlamaParse, extract high-quality textual content and tables from unstructured paperwork, and construct an clever private finance assistant. As a reminder, Gemini fashions might make errors and shouldn’t be relied upon for skilled recommendation.
Why LlamaParse?
In lots of circumstances, LLMs can already carry out this process successfully, nevertheless, when working with giant doc collections or extremely variable codecs, consistency and reliability can turn out to be more difficult.
Devoted instruments like LlamaParse complement LLM capabilities by introducing preprocessing steps and customizable parsing directions, which assist construction advanced components comparable to giant tables or dense textual content. On the whole parsing benchmarks, this strategy has proven round a 13–15% enchancment in comparison with processing uncooked paperwork immediately.
The use case: parsing brokerage statements
Brokerage statements characterize the final word doc parsing problem. They include dense monetary jargon, advanced nested tables, and dynamic layouts.
To assist customers perceive their monetary scenario, you want a workflow that not solely parses the file, however explicitly extracts the tables and explains the information via an LLM.
Due to these superior reasoning and multimodal necessities, Gemini 3.1 Professional is the right match because the underlying mannequin. It balances an enormous context window with native spatial format comprehension.
The workflow operates in 4 levels:
- Ingest: You submit a PDF to the LlamaParse engine.
- Route: The engine parses the doc and emits a
ParsingDoneEvent. - Extract: This occasion triggers two parallel duties — textual content extraction and desk extraction — that run concurrently to reduce latency.
- Synthesize: As soon as each extractions full, Gemini generates a human-readable abstract.
This two-model structure is a deliberate design selection: Gemini 3.1 Professional handles the exhausting layout-comprehension throughout parsing, whereas Gemini 3 Flash handles the ultimate summarization — optimizing for each accuracy and value.
You’ll find the entire code for this tutorial within the LlamaParse x Gemini demo GitHub repository.
Establishing the atmosphere
First, set up the required Python packages for LlamaCloud, LlamaIndex workflows, and the Google GenAI SDK.
# with pip
pip set up llama-cloud-services llama-index-workflows pandas google-genai
# with uv
uv add llama-cloud-services llama-index-workflows pandas google-genai
Shell
Subsequent, export your API keys as atmosphere variables. Get a Gemini API key from AI Studio, and a LlamaCloud API key from the console. Safety Word: By no means hardcode your API keys in your software supply code.
export LLAMA_CLOUD_API_KEY="your_llama_cloud_key"
export GEMINI_API_KEY="your_google_api_key"
Shell
Step 1: Create and use the parser
Step one in your workflow is parsing. You create a LlamaParse shopper backed by Gemini 3.1 Professional and outline it in sources.py so you may inject it into your workflow as a useful resource:
def get_llama_parse() -> LlamaParse:
return LlamaParse(
api_key=os.getenv("LLAMA_CLOUD_API_KEY"),
parse_mode="parse_page_with_agent",
mannequin="gemini-3.1-pro",
result_type=ResultType.MD,
)
Python
The parse_page_with_agent mode applies a layer of agentic iteration guided by Gemini to right and format OCR outcomes based mostly on visible context.
In workflow.py, outline the occasions, state, and the parsing step:
class BrokerageStatementWorkflow(Workflow):
@step
async def parse_file(
self,
ev: FileEvent,
ctx: Context[WorkflowState],
parser: Annotated[LlamaParse, Resource(get_llama_parse)]
) -> ParsingDoneEvent | OutputEvent:
consequence = forged(ParsingJobResult, (await parser.aparse(file_path=ev.input_file)))
async with ctx.retailer.edit_state() as state:
state.parsing_job_result = consequence
return ParsingDoneEvent()
Python
Discover that you don’t course of parsing outcomes instantly. As an alternative, you retailer them within the world WorkflowState so they’re accessible for the extraction steps that observe.
Step 2: Extract the textual content and tables
To supply the LLM with the context required to elucidate the monetary assertion, it’s good to extract the total markdown textual content and the tabular knowledge. Add the extraction steps to your BrokerageStatementWorkflow class (see the total implementation in workflow.py):
@step
async def extract_text(self, ev: ParsingDoneEvent, ctx: Context[WorkflowState]) -> TextExtractionDoneEvent:
# Extraction logic omitted for brevity. See repo.
@step
async def extract_tables(self, ev: ParsingDoneEvent, ctx: Context[WorkflowState], ...) -> TablesExtractionDoneEvent:
# Extraction logic omitted for brevity. See repo.
Python
As a result of each steps hear for a similar ParsingDoneEvent, LlamaIndex Workflows robotically executes them in parallel. This implies your textual content and desk extractions run concurrently — chopping total pipeline latency and making the structure naturally scalable as you add extra extraction duties.
Step 3: Generate the abstract
With the information extracted, you may immediate Gemini 3.1 Professional to generate a abstract in accessible, non-technical language.
Configure the LLM shopper and immediate template in sources.py. Right here, you employ Gemini 3 Flash for the ultimate summarization, because it presents low latency and value effectivity for textual content aggregation duties.
The ultimate synthesis step makes use of ctx.collect_events to attend for each extractions to finish earlier than calling the Gemini API.
@step
async def ask_llm(
self,
ev: TablesExtractionDoneEvent | TextExtractionDoneEvent,
ctx: Context[WorkflowState],
llm: Annotated[GenAIClient, Resource(get_llm)],
template: Annotated[Template, Resource(get_prompt_template)]
) -> OutputEvent:
if ctx.collect_events(ev, [TablesExtractionDoneEvent, TextExtractionDoneEvent]) is None:
return None
# Full immediate and LLM name accessible in repo.
Python
Working the workflow
To tie all of it collectively, the essential.py entry level creates and runs the workflow:
wf = BrokerageStatementWorkflow(timeout=600)
consequence = await wf.run(start_event=FileEvent(input_file=input_file))
Python
To check the workflow, obtain a pattern assertion from the LlamaIndex datasets:
curl -L https://uncooked.githubusercontent.com/run-llama/llama-datasets/essential/llama_agents/bank_statements/brokerage_statement.pdf > brokerage_statement.pdf
Shell
# Utilizing pip
python3 essential.py brokerage_statement.pdf
# Utilizing uv
uv run run-workflow brokerage_statement.pdf
Shell
You now have a completely useful private finance assistant operating in your terminal, able to analyzing advanced monetary PDFs.
Subsequent steps
AI pipelines are solely pretty much as good as the information you feed them. By combining Gemini 3.1 Professional’s multimodal reasoning with LlamaParse’s agentic ingestion, you guarantee your functions have the total, structured context they want — not simply flattened textual content.
Once you base your structure on event-driven statefulness, just like the parallel extractions demonstrated right here, you construct techniques which are quick, scalable, and resilient. Double-check outputs earlier than counting on them.
Able to implement this in manufacturing? Discover LlamaParse and the Gemini API documentation to experiment with multimodal technology, and dive into the full code within the GitHub repository.






