pip install htmldiffer
Or
$ git clone [email protected]:anastasia/htmldiffer.git
$ cd htmldiffer
$ python -m htmldiffer file_one.html file_two.html
HTMLDiffer will take strings or files and return three html diffs: deleted diff, inserted diff, and a combined diff (showing both the deleted and inserted highlights). To use this in a library: HTMLDiffer will
- surround any text-level changes with
<span class="htmldiffer_[insert|delete]>
- insert htmldiffer classes (
class="htmldiffer-tag-change_[insert|delete]
) into any tag-level changes (that is, if a tagname has changed, or any attribute inside a tag has changed)
from htmldiffer import diff
str_a = "<html><body>Hello world!</body></html>"
str_b = "<html><body>Hello wanda! Hello!</body></html>"
d = diff.HTMLDiffer(str_a, str_b)
print(d.deleted_diff)
# get a string of the HTML with deleted elements highlighted:
# <html><body>Hello <span class="diff_delete">world!</span></body></html>
print(d.inserted_diff)
# get a string of the HTML with inserted elements highlighted:
# <html><body>Hello <span class="diff_insert">wanda! </span><span class="diff_insert">Hello!</span></body></html>
print(d.combined_diff)
# get a string of the HTML with both deleted and inserted elements highlighted:
# <html><body>Hello <span class="diff_delete">world!</span><span class="diff_insert">wanda! </span><span class="diff_insert">Hello!</span></body></html>
That's it!
htmldiffer takes a string or a file of html, converts it to string entities[1], then diffs those entities using SequenceMatcher and gets deleted, inserted, and combined (deleted and inserted) html, which include spans wrapping the changed text.
Example:
old_html = "<h1>This is a simple header</h1>"
new_html = "<h1>This is a newer, better header</h1>"
d = HTMLDiffer(old_html, new_html)
d.deleted_diff == "<h1>This is a <span class="diff_delete">simple </span>header</h1>"
d.inserted_diff == "<h1>This is a <span class="diff_insert">newer, </span><span class="diff_insert">better </span>header</h1>"
d.combined_diff == "<h1>This is a <span class="diff_delete">simple </span><span class="diff_insert">newer, </span><span class="diff_insert">better </span>header</h1>"
[1] An entity can be one of several things:
- A word
- An opening tag:
<li class="list-element" style="some:style;">
- A closing tag:
</li>
- A tag that has been whitelisted (self closing tags that you want to highlight changes of are recommended here)
- for instance, by default we're whitelisting image tags, so the entity will be:
<img src="some/source.jpg"/>
- for instance, by default we're whitelisting image tags, so the entity will be:
- The entirety of a blacklisted tag (like a script and head tag, since it's difficult to show changes in those, for now)
<script>The entirety of a script tag will be a single entity</script>
In order to maintain the integrity and structure of the original HTML, we don't remove any whitespaces or change the HTML itself in any way, before iterating through and wrapping it with span tags.
- htmldiffer's
diff
method diff.pyhtml2list
method which iterates through the html string and spits out a list of entities (see above for explanation).
-
diff
adds a style string (default lives in settings.py) to the<head>
of the html (if head tag exists) so that our diff highlights show up -
diff
compares the two newly created lists (two — one is for the old html string, one for the new html string) usingSequenceMatcher
, and gets a list back describing (using codes 'replace', 'delete', 'insert', and 'equal'), for each element A how it got to be element B -
diff
method iterates through that list, calling towrap_text
to wrap each element according to its change value
More complexities! How does wrap_text
work?
-
For each element, if the element is not an html tag, it wraps it in a
<span>
tag with adiff_insert
ordiff_delete
class. -
If the element is an HTML tag,
wrap_text
will skip the element unless the element is insettings.WHITELISTED_TAGS
list. The reason for that is that we don't want to wrap the<li>
opening tag itself, but the changes within that tag.Things to note:
- HTML
<!-- comments -->
will be read as a tag and therefore skipped. - all text that is changed should therefore be wrapped by appropriate
span
diff tags. - the default whitelisted tags include self-closing tags
<img>
and<input>
and will therefore be wrapped inspan
diff tags
- HTML
This repository is a fork off of https://github.com/aaronsw/htmldiff.