-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Labels
Description
Problem
The current PDF implementation (#83) uses basic text extraction (page.extract_text()), which works well for simple PDFs but has limitations for complex legal documents:
- Multi-column layouts - Text reads left-to-right across page, mixing columns
- Tables - Pricing schedules, terms become garbled text
- Headers/footers - "Page X of Y" creates false positives on every page
- Structure - Signature blocks, footnotes, exhibits lose formatting
Proposed Enhancements
Leverage pdfplumber's advanced features (already available in the dependency):
-
Layout-aware extraction
- Use
extract_text(layout=True)to preserve spacing/columns - Or implement reading order detection for proper column flow
- Use
-
Header/footer filtering
- Detect and optionally exclude repeated elements
- Reduce false positives in page-by-page comparisons
-
Table extraction
- Use
extract_tables()for structured data - Preserve table formatting in comparison output
- Use
-
Chunk location tracking
- Utilize
Chunk.chunk_locationfield (already exists) - Store bounding box information for each change
- Enable "this change is on page 3, section 2.1" type reporting
- Utilize
-
Configuration options
PDFFile( "contract.pdf", layout_mode=True, # Preserve spacing exclude_headers=True, # Filter headers/footers extract_tables=True, # Preserve table structure track_positions=True # Store bounding boxes )
Use Cases
- Complex legal contracts with multi-column layouts
- Technical documents with tables and diagrams
- Long documents where knowing "page 47, paragraph 3" matters
- Documents where headers/footers cause noise
References
- pdfplumber documentation: https://github.com/jsvine/pdfplumber
Chunk.chunk_locationfield inprocessor.py:L193(designed for this)- Related tools: Draftable, Workshare Compare (for comparison)
Priority
Medium - Current implementation works for simple PDFs. This is an enhancement for power users with complex documents.