Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should emit only valid XHTML #2

Open
hacst opened this issue Jan 6, 2018 · 1 comment
Open

Should emit only valid XHTML #2

hacst opened this issue Jan 6, 2018 · 1 comment

Comments

@hacst
Copy link
Owner

hacst commented Jan 6, 2018

To make the tool simple we just take the HTML code reddit gives us and pass it on to the EPUB creator library as is. However EPUB only allows XHTML. Because of this there are bound to be various HTML specific thing (e.g. non breaking spaces) that leak into the generated EPUB. Most readers are quite lenient about this but iBooks seems to take the XHTML part seriously.

Ideally we should only emit valid XHTML to make the EPUB as compatible as possible. However this is probably only worth it if it is reasonable simple to implement.

As a workaround we could search & replacing some of the commonly broken things (e.g. nbsp). This obviously is very brittle but might be a decent tradeoff if we can assume the HTML coming out of reddit is restricted to begin with.

Validity of an EPUB can easily be checked by opening it with the Calibre book editor and using Tools->Check Book (F7). There's also https://github.com/IDPF/epubcheck which seems to be quite useful.

hacst added a commit that referenced this issue Jan 6, 2018
EPUBs require valid XHTML which isn't what we get from reddit (see #2).
For now we solve the problem we actually encounter in the wild and that
is HTML entities (e.g.   though reddit also forwards others like
£) slipping into the XHTML. This causes some readers taking XHTML
seriously like iBooks to abort parsing.

To solve this we now replace named entities from HTML4 with their
corresponding numbered variant. Another option would have been to declare
the entities to make them know to XHTML but we cannot easily inject
that into the template.

Obviously just doing this doesn't guarantee valid XHTML by a long shot so
this is just a first step. Should we encounter other issues in the wild
we can consider taking more extensive measures.
@hacst
Copy link
Owner Author

hacst commented Jan 6, 2018

W.r.t to HTML to XHTML JavaScript seems to offer us a simple and portable way to do so as explained int https://stackoverflow.com/a/12092919 . This did not properly cope with entities (e.g.   turned into \n) and adds a surrounding tag but is probably the way to go if we want actual XHTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant