Commit a674b30
committed
Fix line diff by using runes without separators
[The suggested approach](https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs#line-mode
) for doing line level diffing is the following set of steps:
1. `ti1, ti2, linesIdx = DiffLinesToChars(t1, t2)`
2. `diffs = DiffMain(ti1, ti2)`
3. `DiffCharsToLines(diff, linesIdx)`
The original implementation in `google/diff-match-patch` uses
unicode codepoints for storing indices in `ti1` and `ti2` joined by an empty string.
Current implementation in this repo stores them as integers joined by a
comma. While this implementation makes `ti1` and `ti2` more readable, it
introduces bugs when trying to rely on it when doing line level diffing
with `DiffMain`. The root cause of the issue is that an integer line
index might span more than one character/rune, and `DiffMain` can assume
that two different lines having the same index prefix match partially. For
example, indices 123 and 129 will have partial match `12`. In that
example, the diff will show lines 3 and 9 which is not correct. A simple
failing test case demonstrating this issue is available at
`TestDiffPartialLineIndex`.
In this PR I am adjusting the algorithm to use the same approach as in
[diff-match-patch](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L508-L510
) by storing each line index as a rune.
While a rune in Golang is a type alias to uint32, not every uint32
can be a valid rune. During string to rune slice conversion invalid runes will
be replaced with `utf.RuneError`.
The integer to rune generation logic is based on the table in https://en.wikipedia.org/wiki/UTF-8#Encoding
The first 127 lines will work the fastest as they are represented as a
single bytes. Higher numbers are represented as 2-4 bytes.
In addition to that, the range `U+D800 - U+DFFF` contains
[invalid codepoints](https://en.wikipedia.org/wiki/UTF-8#Invalid_sequences_and_error_handling).
and all codepoints higher or equal to `0xD800` are incremented by
`0xDFFF - 0xD800`.
The maximum representable integer using this approach is 1'112'060.
This improves on Javascript implementation which currently
[bails out](https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503-L505
) when files have more than 65535 lines.1 parent 74798f5 commit a674b30
File tree
5 files changed
+133
-38
lines changed- diffmatchpatch
5 files changed
+133
-38
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
38 | | - | |
39 | 37 | | |
40 | 38 | | |
41 | 39 | | |
| |||
406 | 404 | | |
407 | 405 | | |
408 | 406 | | |
409 | | - | |
410 | | - | |
| 407 | + | |
| 408 | + | |
411 | 409 | | |
412 | | - | |
413 | | - | |
414 | | - | |
415 | | - | |
416 | | - | |
| 410 | + | |
| 411 | + | |
417 | 412 | | |
418 | 413 | | |
419 | 414 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
314 | 314 | | |
315 | 315 | | |
316 | 316 | | |
317 | | - | |
318 | | - | |
| 317 | + | |
| 318 | + | |
319 | 319 | | |
320 | | - | |
| 320 | + | |
321 | 321 | | |
322 | | - | |
| 322 | + | |
323 | 323 | | |
324 | 324 | | |
325 | 325 | | |
| |||
332 | 332 | | |
333 | 333 | | |
334 | 334 | | |
335 | | - | |
| 335 | + | |
336 | 336 | | |
337 | 337 | | |
338 | | - | |
| 338 | + | |
339 | 339 | | |
340 | 340 | | |
341 | | - | |
342 | | - | |
| 341 | + | |
343 | 342 | | |
344 | 343 | | |
345 | 344 | | |
| |||
360 | 359 | | |
361 | 360 | | |
362 | 361 | | |
363 | | - | |
364 | | - | |
| 362 | + | |
| 363 | + | |
365 | 364 | | |
366 | 365 | | |
367 | 366 | | |
| |||
380 | 379 | | |
381 | 380 | | |
382 | 381 | | |
383 | | - | |
| 382 | + | |
384 | 383 | | |
385 | 384 | | |
386 | | - | |
| 385 | + | |
387 | 386 | | |
388 | | - | |
389 | | - | |
| 387 | + | |
390 | 388 | | |
391 | 389 | | |
392 | 390 | | |
| |||
1471 | 1469 | | |
1472 | 1470 | | |
1473 | 1471 | | |
1474 | | - | |
| 1472 | + | |
1475 | 1473 | | |
1476 | 1474 | | |
1477 | 1475 | | |
| |||
1481 | 1479 | | |
1482 | 1480 | | |
1483 | 1481 | | |
1484 | | - | |
| 1482 | + | |
1485 | 1483 | | |
1486 | 1484 | | |
1487 | 1485 | | |
| |||
1494 | 1492 | | |
1495 | 1493 | | |
1496 | 1494 | | |
1497 | | - | |
1498 | | - | |
1499 | | - | |
1500 | | - | |
| 1495 | + | |
| 1496 | + | |
| 1497 | + | |
| 1498 | + | |
1501 | 1499 | | |
1502 | 1500 | | |
1503 | 1501 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
17 | 22 | | |
18 | 23 | | |
19 | 24 | | |
| |||
93 | 98 | | |
94 | 99 | | |
95 | 100 | | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
| 101 | + | |
100 | 102 | | |
101 | | - | |
102 | | - | |
| 103 | + | |
103 | 104 | | |
104 | | - | |
105 | 105 | | |
106 | 106 | | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
11 | | - | |
| 11 | + | |
0 commit comments