Should emit only valid XHTML #2

hacst · 2018-01-06T01:31:09Z

To make the tool simple we just take the HTML code reddit gives us and pass it on to the EPUB creator library as is. However EPUB only allows XHTML. Because of this there are bound to be various HTML specific thing (e.g. non breaking spaces) that leak into the generated EPUB. Most readers are quite lenient about this but iBooks seems to take the XHTML part seriously.

Ideally we should only emit valid XHTML to make the EPUB as compatible as possible. However this is probably only worth it if it is reasonable simple to implement.

As a workaround we could search & replacing some of the commonly broken things (e.g. nbsp). This obviously is very brittle but might be a decent tradeoff if we can assume the HTML coming out of reddit is restricted to begin with.

Validity of an EPUB can easily be checked by opening it with the Calibre book editor and using Tools->Check Book (F7). There's also https://github.com/IDPF/epubcheck which seems to be quite useful.

EPUBs require valid XHTML which isn't what we get from reddit (see #2). For now we solve the problem we actually encounter in the wild and that is HTML entities (e.g.   though reddit also forwards others like £) slipping into the XHTML. This causes some readers taking XHTML seriously like iBooks to abort parsing. To solve this we now replace named entities from HTML4 with their corresponding numbered variant. Another option would have been to declare the entities to make them know to XHTML but we cannot easily inject that into the template. Obviously just doing this doesn't guarantee valid XHTML by a long shot so this is just a first step. Should we encounter other issues in the wild we can consider taking more extensive measures.

hacst · 2018-01-06T03:47:49Z

W.r.t to HTML to XHTML JavaScript seems to offer us a simple and portable way to do so as explained int https://stackoverflow.com/a/12092919 . This did not properly cope with entities (e.g.   turned into \n) and adds a surrounding tag but is probably the way to go if we want actual XHTML.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should emit only valid XHTML #2

Should emit only valid XHTML #2

hacst commented Jan 6, 2018 •

edited

Loading

hacst commented Jan 6, 2018

Should emit only valid XHTML #2

Should emit only valid XHTML #2

Comments

hacst commented Jan 6, 2018 • edited Loading

hacst commented Jan 6, 2018

hacst commented Jan 6, 2018 •

edited

Loading