-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Summary
Redlines diffs are accurate, but in some cases the output can be “noisy” (many small edits, awkward boundaries) and/or slower than needed on long inputs. Google’s diff-match-patch is widely used partly because it adds post-processing steps geared toward human readability (semantic cleanup + boundary shifting) and exposes tuning knobs (timeout/edit cost). We should evaluate whether similar ideas would improve Redlines.
Motivation / goals
- Improve human readability of highlighted changes (fewer “pepper” edits, cleaner word boundaries).
- Provide predictable behavior on large inputs (avoid worst-case slowdowns).
- Keep a stable internal diff IR so renderers (HTML/Markdown/etc.) are easy to extend and test.
Ideas to investigate (inspired by diff-match-patch)
1) Cleanup stage for readability
- Semantic cleanup: reduce coincidental tiny matches that create fragmented highlights.
- Boundary shifting (“lossless” cleanup): move edit boundaries to whitespace/punctuation/word boundaries for nicer redlines.
- Add a tuning knob (e.g.,
readability_leveloredit_penalty) to trade granularity vs fewer larger blocks.
2) Hierarchical diffing (multi-resolution)
- Coarse alignment first (paragraph/line/sentence), then refine only changed regions at word-level, and optionally char-level inside changed words.
- Goal: cleaner diffs and better performance on long documents.
3) Performance safety knobs
- Add
timeout/max_time_msoption with a graceful fallback (e.g., skip refinement or return coarser diff) rather than hanging on pathological inputs.
4) Corpus + golden tests
- Add a small curated corpus of “nasty” cases (repeated phrases, whitespace-only changes, punctuation, legal-style text).
- Store expected outputs at the IR level (not just rendered HTML) to lock down behavior.
Acceptance criteria (initial)
- A prototype cleanup stage demonstrably reduces fragmented edits on a small corpus without regressing “normal” cases.
- Hierarchical diffing reduces runtime on at least one large-input benchmark (or avoids worst-case spikes), while producing comparable or better readability.
- New options are documented and defaults preserve current behavior (or changes are clearly justified + noted).
References
- diff-match-patch README / API docs (semantic cleanup, timeout/edit cost, multi-language usage): https://github.com/google/diff-match-patch
- diff-match-patch wiki/API overview: https://github.com/google/diff-match-patch/wiki/API
Notes
This issue is exploratory: start by collecting representative before/after examples from real Redlines usage and use them to guide heuristics.
Metadata
Metadata
Assignees
Labels
No labels