Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert ebooks storage to the database #336

Closed
colagrosso opened this issue Feb 4, 2024 · 13 comments
Closed

Convert ebooks storage to the database #336

colagrosso opened this issue Feb 4, 2024 · 13 comments
Assignees

Comments

@colagrosso
Copy link
Collaborator

This is a large project that may require multiple issues, but we'll start here.

To help with the discussion, I have a draft PR at #335 where I hacked up a schema, script, methods, etc. for Ebooks and EbookTags. No hard feelings if we have to scrap that PR and start a different way.

Question:

  • Would you be willing to create a separate, long-running branch like we did for covers? I think that would save you review time because you could review the changes periodically, and there would be no chance of breaking something in master.

Some notes about draft PR #335:

  • For now the script just reuses the Ebook constructor. Eventually we will have to break up that constructor into separate methods. Some of the logic is needed for populating the DB tables, and some is needed for setting request-time properties.
  • I haven't modified Library::FilterEbooks(), /ebooks/index.php, or /ebooks/ebook.php yet. I thought I would populate more fields in the DB tables first.
  • We'll need some other tables for Contributors, Git commits, Collections, etc.

The more I look, the more work I see to do, but we'll chip away at it. One nice thing is that the ebook webpages are all read-only (unlike the artwork site).

@colagrosso colagrosso self-assigned this Feb 4, 2024
@colagrosso
Copy link
Collaborator Author

I forgot to mention above, but I ran the script in #335 over a hundred or so ebooks, and it picked up the title, URL, and tags just fine. You can skip rebuilding the books and just update the database via:

for BOOK in $(find /standardebooks.org/ebooks -maxdepth 1 -type d)
do
  tsp nice /standardebooks.org/web/scripts/deploy-ebook-to-www -v --no-build --no-images --no-epubcheck --no-recompose --no-feeds --no-bulk-downloads --update-ebook-database $BOOK
done

@acabal
Copy link
Member

acabal commented Feb 5, 2024

For this project, I don't think we need one very large branch. Since we can do this piecemeal and release these updates as we make them, while still leaving the existing filesystem-based system in place to fill in the blanks, we can do each piece of work as its own PR.

@acabal
Copy link
Member

acabal commented Feb 5, 2024

I think we will probably drop Ebook::__construct() entirely in favor of several ::From* methods, like Ebook::FromFilesystem($path), Ebook::Get($ebookId), etc. much like Artwork.

colagrosso added a commit to colagrosso/se-web that referenced this issue May 29, 2024
acabal pushed a commit that referenced this issue May 29, 2024
colagrosso added a commit that referenced this issue Jun 26, 2024
colagrosso added a commit that referenced this issue Jun 30, 2024
colagrosso added a commit that referenced this issue Jul 9, 2024
colagrosso added a commit that referenced this issue Aug 23, 2024
colagrosso added a commit that referenced this issue Sep 5, 2024
colagrosso added a commit that referenced this issue Sep 17, 2024
colagrosso added a commit that referenced this issue Sep 23, 2024
@colagrosso
Copy link
Collaborator Author

Alex, the db-rewrite branch is ready for you to test drive.

The db-rewrite branch will merge cleanly with master, and you can preview the changes here: https://github.com/standardebooks/web/compare/db-rewrite?expand=1

If you have a working site from the master branch, you'll need to add these tables:

config/sql/se/CollectionEbooks.sql  
config/sql/se/Collections.sql  
config/sql/se/Contributors.sql  
config/sql/se/EbookLocSubjects.sql  
config/sql/se/EbookSources.sql  
config/sql/se/EbookTags.sql  
config/sql/se/Ebooks.sql  
config/sql/se/GitCommits.sql  
config/sql/se/LocSubjects.sql  
config/sql/se/TocEntries.sql  

And you'll need to use these ALTER TABLE commands to modify the Tags table in place:

ALTER TABLE `Tags` ADD COLUMN `UrlName` varchar(255) NULL AFTER `Name`;
ALTER TABLE `Tags` ADD COLUMN `Type` enum('artwork', 'ebook') DEFAULT 'artwork' AFTER `UrlName`;
CREATE INDEX `index2` ON `Tags` (`Type`);
CREATE INDEX `index3` ON `Tags` (`UrlName`);

You can test that a single ebook gets added to the DB with:

/standardebooks.org/web/scripts/deploy-ebook-to-www --verbose david-lindsay_a-voyage-to-arcturus.git

If all the ebooks have already been built, you can bulk update/insert them in the DB with something like this:

for BOOK in $(find /standardebooks.org/ebooks -maxdepth 1 -type d)
do
  tsp nice /standardebooks.org/web/scripts/deploy-ebook-to-www --verbose --no-build --no-images --no-recompose --no-epubcheck --no-feeds --no-bulk-downloads "$BOOK"
done

which takes less than a minute on my machine to loop over the 1000+ books.

Two other comments about search and performance:

1. Search

There are some differences in how search will perform when the site is backed by the DB. Here are three I've noticed so far:

Full word vs. word fragments

Searching for "ovid" on the production site

https://standardebooks.org/ebooks?query=ovid

returns six results, but the DB search branch will return only these two:

(Ovid is the author) https://standardebooks.org/ebooks/ovid/metamorphoses/john-dryden_joseph-addison_laurence-eusden_arthur-maynwaring_samuel-croxall_nahum-tate_william-stonestreet_thomas-vernon_john-gay_alexander-pope_stephen-harvey_william-congreve_et-al

(Ovid is mentioned in the ToC (link))
https://standardebooks.org/ebooks/phillis-wheatley/poems-on-various-subjects-religious-and-moral

The four other matches on the production site come from matching "providence".

Similar story for searching for "grant" on the production site

https://standardebooks.org/ebooks?query=grant

which returns ten results, but the DB search branch will return only these two:

(grant in the ToC (link)) https://standardebooks.org/ebooks/ambrose-bierce/poetry

(grant in title and author) https://standardebooks.org/ebooks/ulysses-s-grant/personal-memoirs-of-ulysses-s-grant

The other matches on the production site come from ToC entries, tags, or LoC subjects with these words:

  • immigrants
  • vagrant

The way the `FULLTEXT` index is built, it's not possible to match providence when searching for ovid. Same with immigrants or vagrant when searching for grant. Moreover, I'd argue those aren't useful matches.

Stopwords

Stopwords searching for stopwords like the, of, in, etc. will be different. These return hundreds of results on the production site, but they will return zero when backed by DB search.

Author sorting with Unicode characters

On the production site when sorting by author name, Thomas à Kempis and Karel Čapek are at the end:

https://standardebooks.org/ebooks?page=22&per-page=48&sort=author-alpha

When MySQL sorts by author name, the book by Thomas à Kempis is now first, and the books by Karel Čapek are sorted between these two authors:

John W. Campbell
Karel Čapek
Thomas Carlyle

I believe this is a slight improvement.

2. Performance

Performance on my test droplet seemed equivalent to the production site, but I didn't run a full load test. Let me know if I should once we fix other feedback you have about correctness. I had to increase the RAM on the droplet to 1GB to run the build suite and MySQL at the same time, so it was on the second smallest droplet from DigitalOcean.

There are a fair number of queries required to render www/ebooks/index.php and www/ebooks/ebook.php. They are all fast, but it's a long list. To show an ebook's thumbnail, the latest Git commit is required for the image URL, and the full title with collaborators is needed for the image's alt text. Those are simple, indexed queries, so maybe performance will be OK. I know it's a risk to optimize prematurely. It's just that both of those two pages—the index/search results page plus the single book page with the carousel at the bottom—show a fair number of thumbnails.

@robinwhittleton
Copy link
Member

Sorting and text fragment improvements sound great. Is there any improvement on searching for Poe? At the moment you get pages of poetry and no E. A.

@colagrosso
Copy link
Collaborator Author

colagrosso commented Sep 23, 2024

Is there any improvement on searching for Poe? At the moment you get pages of poetry and no E. A.

Yes, thanks for asking! It now returns just Poe's work:

Screenshot_2024-09-23_08-05-47

Poe was one of the test cases I used for this commit: c3bc46a (PR #394)

@acabal
Copy link
Member

acabal commented Oct 2, 2024

Great Mike, I'm going to review this in the coming week.

The first question is in /scripts/update-ebook-database. At the end it checks if there are differences between the ebook based on the filesystem, and the ebook based on the DB representation.

But, we call Ebook::CreateOrUpdate() before calling Ebook::GetByIdentifier(). Doesn't this mean that the ebook will always match, since we just saved it to the DB, and then we retrieved the row we just saved?

Secondly, why do we care if there are differences? The script emits an error if there are differences - but when does this matter?

Thirdly, the script emits an error after already having made changes. This is surprising - I would expect that when an error occurs, we stop processing and no changes are saved, instead of saving changes and then letting the user know that something went wrong.

@acabal
Copy link
Member

acabal commented Oct 2, 2024

Can you also open a draft PR for this entire project, so I can add some inline comments?

@colagrosso
Copy link
Collaborator Author

Can you also open a draft PR for this entire project, so I can add some inline comments?

Yep, you got it: #401

I'll respond to your question about /scripts/update-ebook-database here in a moment if that's OK and not too confusing to have that discussion here and inline comments in the PR.

@colagrosso
Copy link
Collaborator Author

One preamble before I answer your specific questions about /scripts/update-ebook-database: We're moving from a site where the source of truth was the filesystem to one where it's the DB, and we want to make sure the user would see the same thing either way. We now have two ways to create Ebook objects: Ebook::FromFilesystem() and Ebook:GetByIdentifier(). If there is a bug, it would be nice to know before a user reports a problem.

But, we call Ebook::CreateOrUpdate() before calling Ebook::GetByIdentifier(). Doesn't this mean that the ebook will always match, since we just saved it to the DB, and then we retrieved the row we just saved?

If there are no bugs, the two Ebook objects will always match. If there was a bug in writing data to the DB or an any Accessor methods like GetGitCommits(), GetTags(), etc., there will be differences.

During development, there were several times that the check found errors due to mistakes I made, so I found it useful. For example, I would miss a field or write data in the wrong order.

Secondly, why do we care if there are differences? The script emits an error if there are differences - but when does this matter?

Knowing that there are no differences gives me the confidence to say the site can switch to the DB as the source of truth and users won't notice a difference.

Thirdly, the script emits an error after already having made changes. This is surprising - I would expect that when an error occurs, we stop processing and no changes are saved, instead of saving changes and then letting the user know that something went wrong.

That's a fair criticism. My approach was to answer the question: "Does reading an Ebook from the filesystem match reading an Ebook from the DB?" In order to answer that question, the Ebook has to be written to the DB. Maybe there's another approach.

I can tell by your questions that you're considering removing the check from the script. If so, that's fine with me. No hard feelings. Maybe you'd rather have a second script, e.g., compare-filesystem-and-database, that could be run periodically to check for differences. But it seems more useful to know right when a problem crops up, and I'm sure there will be bugs in the future.

@acabal
Copy link
Member

acabal commented Oct 2, 2024

OK, so it sounds like it's for debugging, which is fine. We can keep it for now as a transitional debugging tool but once we're fully on the DB we can remove it.

I think that if Ebook::CreateOrUpdate() fails, it should throw an exception. This method can fail many different ways, but we know all of the many failure cases (property not filled, failed getting commit, whatever) and the user should be alerted at the time of failure, not after it has already commited data to the DB. Checking after the fact with a complex diff function is not the robust way to handle unexpected input - by doing this we are admitting that our own function is not robust enough to trust.

It already does throw a ValidationException which is uncaught by this script. So a failure in filling out required fields should result in that exception being thrown and then we display it nicely for the user before aborting. I pointed out a few places in the code where items where not fully validated, so maybe fixing those will make this check not necessary. Remember when validating we must check that a property isset() and (typically) not empty because those are all failures of validation.

@colagrosso
Copy link
Collaborator Author

colagrosso commented Oct 5, 2024

So a failure in filling out required fields should result in that exception being thrown and then we display it nicely for the user before aborting.

I updated scripts/update-ebook-database to take a whack at this. I deliberately introduced some errors to Identifier and RepoFilesystemPath, and this is what the script printed:

(Edit: Added the class names when printing validation exceptions.)

$ nice /standardebooks.org/web/scripts/deploy-ebook-to-www --verbose --no-build --no-images --no-recompose --no-epubcheck --no-feeds --no-bulk-downloads virginia-woolf_to-the-lighthouse.git
Entering /standardebooks.org/ebooks/virginia-woolf_to-the-lighthouse.git
Updating/inserting ebook in database ... 
ValidationExceptions when updating the ebook database:
        Exceptions\EbookIdentifierRequiredException: Ebook Identifier required.
        Exceptions\InvalidEbookRepoFilesystemPathException: Invalid RepoFilesystemPath. Not readable: garbage-path
[ERROR] Found 2 errors

colagrosso added a commit that referenced this issue Oct 23, 2024
colagrosso added a commit that referenced this issue Oct 31, 2024
acabal pushed a commit that referenced this issue Nov 4, 2024
@acabal
Copy link
Member

acabal commented Nov 8, 2024

It's been live for a few days and everything looks good. Great work Mike!

@acabal acabal closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants