Enhanced PDF support: Layout-aware extraction and structure preservation

## Problem

The current PDF implementation (#83) uses basic text extraction (`page.extract_text()`), which works well for simple PDFs but has limitations for complex legal documents:

- **Multi-column layouts** - Text reads left-to-right across page, mixing columns
- **Tables** - Pricing schedules, terms become garbled text
- **Headers/footers** - "Page X of Y" creates false positives on every page
- **Structure** - Signature blocks, footnotes, exhibits lose formatting

## Proposed Enhancements

Leverage pdfplumber's advanced features (already available in the dependency):

1. **Layout-aware extraction**
   - Use `extract_text(layout=True)` to preserve spacing/columns
   - Or implement reading order detection for proper column flow

2. **Header/footer filtering**
   - Detect and optionally exclude repeated elements
   - Reduce false positives in page-by-page comparisons

3. **Table extraction**
   - Use `extract_tables()` for structured data
   - Preserve table formatting in comparison output

4. **Chunk location tracking**
   - Utilize `Chunk.chunk_location` field (already exists)
   - Store bounding box information for each change
   - Enable "this change is on page 3, section 2.1" type reporting

5. **Configuration options**
   ```python
   PDFFile(
       "contract.pdf",
       layout_mode=True,          # Preserve spacing
       exclude_headers=True,      # Filter headers/footers
       extract_tables=True,       # Preserve table structure
       track_positions=True       # Store bounding boxes
   )
   ```

## Use Cases

- Complex legal contracts with multi-column layouts
- Technical documents with tables and diagrams
- Long documents where knowing "page 47, paragraph 3" matters
- Documents where headers/footers cause noise

## References

- pdfplumber documentation: https://github.com/jsvine/pdfplumber
- `Chunk.chunk_location` field in `processor.py:L193` (designed for this)
- Related tools: Draftable, Workshare Compare (for comparison)

## Priority

Medium - Current implementation works for simple PDFs. This is an enhancement for power users with complex documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhanced PDF support: Layout-aware extraction and structure preservation #84

Problem

Proposed Enhancements

Use Cases

References

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enhanced PDF support: Layout-aware extraction and structure preservation #84

Description

Problem

Proposed Enhancements

Use Cases

References

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions