Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider a better support of ZIM files without books in HTML #95

Open
kelson42 opened this issue Nov 12, 2019 · 14 comments
Open

Consider a better support of ZIM files without books in HTML #95

kelson42 opened this issue Nov 12, 2019 · 14 comments
Milestone

Comments

@kelson42
Copy link
Contributor

kelson42 commented Nov 12, 2019

I think we should maybe consider a better support of ZIM files without HTML. The reasons are:

  • This would save disk usage (I estimate a reduction of 30% of the whole size)
  • Make easier the integration of third party ebook sources which don't (easily) provide HTML version of their content

Currently I see two big reasons to keep the HTML versions:
1 - Full text engine applying to HTML only
2 - Ability to directly see the content

These two things might be fixed with:
1 - Support ability to fulltext index EPUBs (relatively easy) see openzim/libzim#289
2 - Providing readers for multiple platforms within the ZIM... even maybe a pure Web Epub reader?

@kelson42
Copy link
Contributor Author

kelson42 commented Nov 12, 2019

@eshellman This ticket might be of interest for you

@Popolechien
Copy link

Sure, sounds good but what would the final output look like compared to what we have now?

@kelson42
Copy link
Contributor Author

Sure, sounds good but what would the final output look like compared to what we have now?

@Popolechien Same without the book in HTML directly usable from the browser, in place we would have an info page explaining how to read the EPUB file from the Browsers, mobile, computer, etc...

@Popolechien
Copy link

@kelson42 Well, if the idea is to save some space, how about offering either Gutenberg epubs or Gutenberg HTML and hope that people would know the difference? Much like we have Wikipedia with or without images, in a way.

@kelson42
Copy link
Contributor Author

kelson42 commented Nov 12, 2019

@kelson42 Well, if the idea is to save some space, how about offering either Gutenberg epubs or Gutenberg HTML and hope that people would know the difference? Much like we have Wikipedia with or without images, in a way.

@Popolechien This might be done, can be already done, but this is not the point of the ticket which is about providing a better UX without HTML. Buy maybe you just want to say "I don't think we need that: we should provide one with HTML and one with EPUB and people can only have one or the either and live with that."

@Popolechien
Copy link

we should provide one with HTML and one with EPUB and people can only have one or the either and live with that.

Yes.

@kelson42
Copy link
Contributor Author

@Popolechien To me this would be a fallback solution. But I believe we might be able to solve the problem properly.

We could be able to solve (2) in an even better manner by using a pure javascript EPUB reader (so for the end user) it would be a similar experience as having the HTML in the ZIM file. We could for example use https://github.com/futurepress/epub.js/

@eshellman
Copy link
Collaborator

eshellman commented Nov 14, 2019 via email

@stale
Copy link

stale bot commented Aug 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Aug 22, 2020
@kelson42 kelson42 added this to the 1.2.0 milestone Dec 17, 2022
@stale stale bot removed the stale label Dec 17, 2022
@kelson42
Copy link
Contributor Author

Once #136 implemented, we should be able to implement this ticket. The scraper would download the EPUB, parse it to extra the key words for the search engine. Epub.js should be able to make the EPUB directly readable in the ZIM (to best tested).

@kelson42 kelson42 removed the question label Dec 20, 2022
@kelson42 kelson42 changed the title Consider a better support of ZIM files without HTML Consider a better support of ZIM files without books in HTML Jan 7, 2023
@kelson42 kelson42 pinned this issue Jan 25, 2023
@rgaudin
Copy link
Member

rgaudin commented Jan 26, 2023

The most difficult part here is the one that's not been mentioned: the UI. With our generic UI that

What does entries look like? An html shell that displays epub.js on size 100%? Should it include a link/button to download/open the epub should you have an external epub reader?

I believe the search topic deserves its own ticket. openzim/libzim#289 seems like a wrong solution to the problem. We don't want libzim to index epub. If libzim does it, then search results would point to the .epub entry and not to our epub.js shell… If we want to index the shell, then we need the libzim NOT to index .epub ones, otherwise we'll double index size

We'd need a scraper-level epub parser (and html, and pdf). Actually we could already (when also including HTML) build indexdata on the cover article and disable libzim one on the HTML book so that search points to the cover and not the HTML itself.

Now one issue would be that books are very long and epub (and PDF) are paginated. If you're searching for an expression, is it acceptable to just link to the book cover? In a WP article, it's single page so despite being cumbersome, you can easily ^F and find that text again.

In epub.js there is no search-in-book feature (yet??) so if you were not looking for a book but for an extract, it's gonna be useless… and I believe finding books is not what fulltext index is about (home page search does it probably better)

@kelson42 kelson42 modified the milestones: 2.0.0, 3.0.0, 2.2.0 Feb 26, 2023
@Jaifroid
Copy link

I risk sounding like a broken record, but please remember users with older browsers and OS's, as well as those with restrictive CSPs. HTML is a universal way to access content that is supported everywhere (at least, static HTML). While it's fine if we can include a system in the ZIM to convert EPUB or PDF content to accessible (and searchable) HTML, we would need to be sure that such readers run under old browsers and restrictive CSPs. Otherwise you risk making ZIMs even more inaccessible than they already are. Even a modern Chrome extension can't access the current dynamic UI due to its use of inline JS (#145), and that is only going to get worse with the stricter CSPs in manifest v3 extensions also: kiwix/kiwix-js#755.

So, I agree with the caution expressed by @rgaudin, but for slightly different reasons.

@Jaifroid
Copy link

I've just checked, and epub.js doesn't work in IE11. Yes, IE11 is now history, but it's still a good proxy for old browser support...

image

@benoit74
Copy link
Collaborator

For those who do not yet knows about it, integrating an epub and a pdf reader has already been done for kolibri scraper.

There is even a download button for those who prefer to use another reader.

Other questions regarding resulting UI and the creation of multiple ZIMs (all, epub_only, html_only, pdf_only) are still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants