Explore diff-match-patch-style cleanup + hierarchical diffing to improve Redlines readability/perf

## Summary
Redlines diffs are accurate, but in some cases the output can be “noisy” (many small edits, awkward boundaries) and/or slower than needed on long inputs. Google’s diff-match-patch is widely used partly because it adds post-processing steps geared toward human readability (semantic cleanup + boundary shifting) and exposes tuning knobs (timeout/edit cost). We should evaluate whether similar ideas would improve Redlines.

## Motivation / goals
- Improve **human readability** of highlighted changes (fewer “pepper” edits, cleaner word boundaries).
- Provide predictable behavior on large inputs (avoid worst-case slowdowns).
- Keep a stable internal **diff IR** so renderers (HTML/Markdown/etc.) are easy to extend and test.

## Ideas to investigate (inspired by diff-match-patch)
### 1) Cleanup stage for readability
- **Semantic cleanup:** reduce coincidental tiny matches that create fragmented highlights.
- **Boundary shifting (“lossless” cleanup):** move edit boundaries to whitespace/punctuation/word boundaries for nicer redlines.
- Add a tuning knob (e.g., `readability_level` or `edit_penalty`) to trade granularity vs fewer larger blocks.

### 2) Hierarchical diffing (multi-resolution)
- Coarse alignment first (paragraph/line/sentence), then refine only changed regions at word-level, and optionally char-level inside changed words.
- Goal: cleaner diffs and better performance on long documents.

### 3) Performance safety knobs
- Add `timeout` / `max_time_ms` option with a graceful fallback (e.g., skip refinement or return coarser diff) rather than hanging on pathological inputs.

### 4) Corpus + golden tests
- Add a small curated corpus of “nasty” cases (repeated phrases, whitespace-only changes, punctuation, legal-style text).
- Store expected outputs at the **IR level** (not just rendered HTML) to lock down behavior.

## Acceptance criteria (initial)
- A prototype cleanup stage demonstrably reduces fragmented edits on a small corpus without regressing “normal” cases.
- Hierarchical diffing reduces runtime on at least one large-input benchmark (or avoids worst-case spikes), while producing comparable or better readability.
- New options are documented and defaults preserve current behavior (or changes are clearly justified + noted).

## References
- diff-match-patch README / API docs (semantic cleanup, timeout/edit cost, multi-language usage): https://github.com/google/diff-match-patch
- diff-match-patch wiki/API overview: https://github.com/google/diff-match-patch/wiki/API

## Notes
This issue is exploratory: start by collecting representative before/after examples from real Redlines usage and use them to guide heuristics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore diff-match-patch-style cleanup + hierarchical diffing to improve Redlines readability/perf #85

Summary

Motivation / goals

Ideas to investigate (inspired by diff-match-patch)

1) Cleanup stage for readability

2) Hierarchical diffing (multi-resolution)

3) Performance safety knobs

4) Corpus + golden tests

Acceptance criteria (initial)

References

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Explore diff-match-patch-style cleanup + hierarchical diffing to improve Redlines readability/perf #85

Description

Summary

Motivation / goals

Ideas to investigate (inspired by diff-match-patch)

1) Cleanup stage for readability

2) Hierarchical diffing (multi-resolution)

3) Performance safety knobs

4) Corpus + golden tests

Acceptance criteria (initial)

References

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions