Introducing LangExtract: A Gemini powered info extraction library

In at the moment’s data-rich world, useful insights are sometimes locked away in unstructured textual content, equivalent to detailed medical notes, prolonged authorized paperwork, buyer suggestions threads and evolving information experiences. Manually sifting by means of this info or constructing bespoke code to course of the info is time-consuming and error-prone, and utilizing fashionable giant language fashions (LLMs) naively could introduce errors. What when you may programmatically extract the precise info you want, whereas making certain the outputs are structured and reliably tied again to its supply?

In the present day, we’re excited to introduce LangExtract, a brand new open-source Python library designed to empower builders to do exactly that. LangExtract offers a light-weight interface to numerous LLMs equivalent to our Gemini fashions for processing giant volumes of unstructured textual content into structured info primarily based in your customized directions, making certain each flexibility and traceability.

Whether or not you are working with medical experiences, monetary summaries, or every other text-heavy area, LangExtract provides a versatile and highly effective technique to unlock the info inside.

LangExtract provides a novel mixture of capabilities that make it helpful for info extraction:

Exact supply grounding: Each extracted entity is mapped again to its precise character offsets within the supply textual content. As demonstrated within the animations beneath, this characteristic offers traceability by visually highlighting every extraction within the authentic textual content, making it a lot simpler to guage and confirm the extracted info.

Optimized long-context info extraction: Info retrieval from giant paperwork could be advanced. As an illustration, whereas LLMs present sturdy efficiency on many benchmarks, needle-in-a-haystack checks throughout million-token contexts present that recall can lower in multi-fact retrieval eventualities. LangExtract is constructed to deal with this utilizing a chunking technique, parallel processing and a number of extraction passes over smaller, centered contexts.

Interactive visualization: Go from uncooked textual content to an interactive, self-contained HTML visualization in minutes. LangExtract makes it simple to evaluation extracted entities in context, with help for exploring hundreds of annotations.

Versatile help for LLM backends: Work together with your most well-liked fashions, whether or not they’re cloud-based LLMs (like Google’s Gemini household) or open-source on-device fashions.

Versatile throughout domains: Outline info extraction duties for any area with only a few well-chosen examples, with out the necessity to fine-tune an LLM. LangExtract “learns” your required output and might apply it to giant, new textual content inputs. See the way it works with this medicine extraction instance.

Using LLM world data: Along with extracting grounded entities, LangExtract can leverage a mannequin’s world data to complement extracted info. This info could be specific (i.e., derived from the supply textual content) or inferred (i.e., derived from the mannequin’s inherent world data). The accuracy and relevance of such supplementary data, significantly when inferred, are closely influenced by the chosen LLM’s capabilities and the precision of the immediate examples guiding the extraction.

Fast begin: From Shakespeare to structured objects

This is the way to extract character particulars from a line of Shakespeare.

First, set up the library:

For extra detailed setup directions, together with digital environments and API key configuration, please see the undertaking README.

Subsequent, outline your extraction activity. Present a transparent immediate and a high-quality “few-shot” instance to information the mannequin.

import textwrap
import langextract as lx

# 1. Outline a concise immediate
immediate = textwrap.dedent("""
Extract characters, feelings, and relationships so as of look.
Use precise textual content for extractions. Don't paraphrase or overlap entities.
Present significant attributes for every entity so as to add context.""")

# 2. Present a high-quality instance to information the mannequin
examples = [
    lx.data.ExampleData(
        text=(
            "ROMEO. But soft! What light through yonder window breaks? It is"
            " the east, and Juliet is the sun."
        ),
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"},
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"},
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor"},
            ),
        ],
    )
]

# 3. Run the extraction in your enter textual content
input_text = (
    "Woman Juliet gazed longingly on the stars, her coronary heart aching for Romeo"
)
consequence = lx.extract(
    text_or_documents=input_text,
    prompt_description=immediate,
    examples=examples,
    model_id="gemini-2.5-pro",
)

Python

The consequence object accommodates the extracted entities, which could be saved to a JSONL file. From there, you possibly can generate an interactive HTML file to view the annotations. This visualization is nice for demos or evaluating the extraction high quality, saving useful time. It really works seamlessly in environments like Google Colab or could be saved as a standalone HTML file, viewable out of your browser.

# Save the outcomes to a JSONL file
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")

# Generate the interactive visualization from the file
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)

Python

Flexibility for specialised domains

The identical rules above apply to specialised domains like medication, finance, engineering or regulation. The concepts behind LangExtract have been first utilized to medical info extraction and could be efficient at processing medical textual content. For instance, it may determine medicines, dosages, and different medicine attributes, after which map the relationships between them. This functionality was a core a part of the analysis that led to this library, which you’ll examine in our paper on accelerating medical info extraction.

The animation beneath exhibits LangExtract processing medical textual content to extract medication-related entities and teams them to the supply medicine.

Demo on structured radiology reporting

To showcase LangExtract’s energy in a specialised area, we developed an interactive demonstration for structured radiology reporting known as RadExtract on Hugging Face. This demo exhibits how LangExtract can course of a free-text radiology report and routinely convert its key findings right into a structured format, additionally highlighting essential findings. This method is essential in radiology, the place structuring experiences enhances readability, ensures completeness, and improves knowledge interoperability for analysis and medical care.

_Disclaimer:_{The medicine extraction instance and structured reporting demo above are for illustrative functions of LangExtract’s baseline functionality solely. It doesn’t signify a completed or accepted product, just isn’t meant to diagnose or recommend remedy of any illness or situation, and shouldn’t be used for medical recommendation.}

Get began with LangExtract: Sources and subsequent steps

We’re excited to see the modern methods builders will use LangExtract to unlock insights from textual content. Dive into the documentation, discover the examples on our GitHub repository, and begin remodeling your unstructured knowledge at the moment.