Lexoid

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Motivation:

Use the multi-modal advancement of LLMs
Enable convenience for users while driving innovation
Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

Optionally, to use Playwright for retrieving web content with the .whl package (else regular requests will be used by default):

playwright install --with-deps --only-shell chromium

Building `.whl` from source

To create .whl:

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE", raw=True)
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE", raw=True)

print(parsed_md)

Parameters

path (str): The file path or URL.
parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
raw (bool, optional): Return raw text or structured data. Defaults to False.
pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
**kwargs: Additional arguments for the parser.

Benchmark

Initial results (more updates soon) Note: Benchmarks done in zero-shot scenario currently

Rank	Model/Framework	Similarity	Time (s)
1	gpt-4o	0.799	21.77
2	gemini-2.0-flash-exp	0.797	13.47
3	gemini-exp-1121	0.779	30.88
4	gemini-1.5-pro	0.742	15.77
5	gpt-4o-mini	0.721	14.86
6	gemini-1.5-flash	0.702	4.56
7	Llama-3.2-11B-Vision-Instruct (via HF)	0.582	21.74
8	Llama-3.2-11B-Vision-Instruct-Turbo (via Together AI)	0.556	4.58
9	Llama-3.2-90B-Vision-Instruct-Turbo (via Together AI)	0.527	10.57
10	Llama-Vision-Free (via Together AI)	0.435	8.42

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
examples		examples
lexoid		lexoid
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lexoid

Motivation:

Installation

Installing with pip

Building `.whl` from source

Creating a local installation

Usage

Parameters

Benchmark

About

Releases 7

Packages

Contributors 3

Languages

License

oidlabs-com/Lexoid

Folders and files

Latest commit

History

Repository files navigation

Lexoid

Motivation:

Installation

Installing with pip

Building .whl from source

Creating a local installation

Usage

Parameters

Benchmark

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 3

Languages

Building `.whl` from source

Packages