Convert ebooks storage to the database #336

colagrosso · 2024-02-04T05:11:09Z

This is a large project that may require multiple issues, but we'll start here.

To help with the discussion, I have a draft PR at #335 where I hacked up a schema, script, methods, etc. for Ebooks and EbookTags. No hard feelings if we have to scrap that PR and start a different way.

Question:

Would you be willing to create a separate, long-running branch like we did for covers? I think that would save you review time because you could review the changes periodically, and there would be no chance of breaking something in master.

Some notes about draft PR #335:

For now the script just reuses the Ebook constructor. Eventually we will have to break up that constructor into separate methods. Some of the logic is needed for populating the DB tables, and some is needed for setting request-time properties.
I haven't modified Library::FilterEbooks(), /ebooks/index.php, or /ebooks/ebook.php yet. I thought I would populate more fields in the DB tables first.
We'll need some other tables for Contributors, Git commits, Collections, etc.

The more I look, the more work I see to do, but we'll chip away at it. One nice thing is that the ebook webpages are all read-only (unlike the artwork site).

The text was updated successfully, but these errors were encountered:

colagrosso · 2024-02-04T06:14:43Z

I forgot to mention above, but I ran the script in #335 over a hundred or so ebooks, and it picked up the title, URL, and tags just fine. You can skip rebuilding the books and just update the database via:

for BOOK in $(find /standardebooks.org/ebooks -maxdepth 1 -type d)
do
  tsp nice /standardebooks.org/web/scripts/deploy-ebook-to-www -v --no-build --no-images --no-epubcheck --no-recompose --no-feeds --no-bulk-downloads --update-ebook-database $BOOK
done

acabal · 2024-02-05T19:54:15Z

For this project, I don't think we need one very large branch. Since we can do this piecemeal and release these updates as we make them, while still leaving the existing filesystem-based system in place to fill in the blanks, we can do each piece of work as its own PR.

acabal · 2024-02-05T20:01:11Z

I think we will probably drop Ebook::__construct() entirely in favor of several ::From* methods, like Ebook::FromFilesystem($path), Ebook::Get($ebookId), etc. much like Artwork.

colagrosso · 2024-09-23T00:38:06Z

Alex, the db-rewrite branch is ready for you to test drive.

The db-rewrite branch will merge cleanly with master, and you can preview the changes here: https://github.com/standardebooks/web/compare/db-rewrite?expand=1

If you have a working site from the master branch, you'll need to add these tables:

config/sql/se/CollectionEbooks.sql  
config/sql/se/Collections.sql  
config/sql/se/Contributors.sql  
config/sql/se/EbookLocSubjects.sql  
config/sql/se/EbookSources.sql  
config/sql/se/EbookTags.sql  
config/sql/se/Ebooks.sql  
config/sql/se/GitCommits.sql  
config/sql/se/LocSubjects.sql  
config/sql/se/TocEntries.sql

And you'll need to use these ALTER TABLE commands to modify the Tags table in place:

ALTER TABLE `Tags` ADD COLUMN `UrlName` varchar(255) NULL AFTER `Name`;
ALTER TABLE `Tags` ADD COLUMN `Type` enum('artwork', 'ebook') DEFAULT 'artwork' AFTER `UrlName`;
CREATE INDEX `index2` ON `Tags` (`Type`);
CREATE INDEX `index3` ON `Tags` (`UrlName`);

You can test that a single ebook gets added to the DB with:

/standardebooks.org/web/scripts/deploy-ebook-to-www --verbose david-lindsay_a-voyage-to-arcturus.git

If all the ebooks have already been built, you can bulk update/insert them in the DB with something like this:

for BOOK in $(find /standardebooks.org/ebooks -maxdepth 1 -type d)
do
  tsp nice /standardebooks.org/web/scripts/deploy-ebook-to-www --verbose --no-build --no-images --no-recompose --no-epubcheck --no-feeds --no-bulk-downloads "$BOOK"
done

which takes less than a minute on my machine to loop over the 1000+ books.

Two other comments about search and performance:

1. Search

There are some differences in how search will perform when the site is backed by the DB. Here are three I've noticed so far:

Full word vs. word fragments

Searching for "ovid" on the production site

https://standardebooks.org/ebooks?query=ovid

returns six results, but the DB search branch will return only these two:

(Ovid is the author) https://standardebooks.org/ebooks/ovid/metamorphoses/john-dryden_joseph-addison_laurence-eusden_arthur-maynwaring_samuel-croxall_nahum-tate_william-stonestreet_thomas-vernon_john-gay_alexander-pope_stephen-harvey_william-congreve_et-al

(Ovid is mentioned in the ToC (link))
https://standardebooks.org/ebooks/phillis-wheatley/poems-on-various-subjects-religious-and-moral

The four other matches on the production site come from matching "providence".

Similar story for searching for "grant" on the production site

https://standardebooks.org/ebooks?query=grant

which returns ten results, but the DB search branch will return only these two:

(grant in the ToC (link)) https://standardebooks.org/ebooks/ambrose-bierce/poetry

(grant in title and author) https://standardebooks.org/ebooks/ulysses-s-grant/personal-memoirs-of-ulysses-s-grant

The other matches on the production site come from ToC entries, tags, or LoC subjects with these words:

immigrants
vagrant

The way the `FULLTEXT` index is built, it's not possible to match providence when searching for ovid. Same with immigrants or vagrant when searching for grant. Moreover, I'd argue those aren't useful matches.

Stopwords

Stopwords searching for stopwords like the, of, in, etc. will be different. These return hundreds of results on the production site, but they will return zero when backed by DB search.

Author sorting with Unicode characters

On the production site when sorting by author name, Thomas à Kempis and Karel Čapek are at the end:

https://standardebooks.org/ebooks?page=22&per-page=48&sort=author-alpha

When MySQL sorts by author name, the book by Thomas à Kempis is now first, and the books by Karel Čapek are sorted between these two authors:

John W. Campbell
Karel Čapek
Thomas Carlyle

I believe this is a slight improvement.

2. Performance

Performance on my test droplet seemed equivalent to the production site, but I didn't run a full load test. Let me know if I should once we fix other feedback you have about correctness. I had to increase the RAM on the droplet to 1GB to run the build suite and MySQL at the same time, so it was on the second smallest droplet from DigitalOcean.

There are a fair number of queries required to render www/ebooks/index.php and www/ebooks/ebook.php. They are all fast, but it's a long list. To show an ebook's thumbnail, the latest Git commit is required for the image URL, and the full title with collaborators is needed for the image's alt text. Those are simple, indexed queries, so maybe performance will be OK. I know it's a risk to optimize prematurely. It's just that both of those two pages—the index/search results page plus the single book page with the carousel at the bottom—show a fair number of thumbnails.

robinwhittleton · 2024-09-23T05:05:41Z

Sorting and text fragment improvements sound great. Is there any improvement on searching for Poe? At the moment you get pages of poetry and no E. A.

colagrosso · 2024-09-23T14:10:11Z

Is there any improvement on searching for Poe? At the moment you get pages of poetry and no E. A.

Yes, thanks for asking! It now returns just Poe's work:

Poe was one of the test cases I used for this commit: c3bc46a (PR #394)

acabal · 2024-10-02T20:51:48Z

Great Mike, I'm going to review this in the coming week.

The first question is in /scripts/update-ebook-database. At the end it checks if there are differences between the ebook based on the filesystem, and the ebook based on the DB representation.

But, we call Ebook::CreateOrUpdate() before calling Ebook::GetByIdentifier(). Doesn't this mean that the ebook will always match, since we just saved it to the DB, and then we retrieved the row we just saved?

Secondly, why do we care if there are differences? The script emits an error if there are differences - but when does this matter?

Thirdly, the script emits an error after already having made changes. This is surprising - I would expect that when an error occurs, we stop processing and no changes are saved, instead of saving changes and then letting the user know that something went wrong.

acabal · 2024-10-02T20:54:38Z

Can you also open a draft PR for this entire project, so I can add some inline comments?

colagrosso · 2024-10-02T21:40:40Z

Can you also open a draft PR for this entire project, so I can add some inline comments?

Yep, you got it: #401

I'll respond to your question about /scripts/update-ebook-database here in a moment if that's OK and not too confusing to have that discussion here and inline comments in the PR.

colagrosso · 2024-10-02T22:32:57Z

One preamble before I answer your specific questions about /scripts/update-ebook-database: We're moving from a site where the source of truth was the filesystem to one where it's the DB, and we want to make sure the user would see the same thing either way. We now have two ways to create Ebook objects: Ebook::FromFilesystem() and Ebook:GetByIdentifier(). If there is a bug, it would be nice to know before a user reports a problem.

But, we call Ebook::CreateOrUpdate() before calling Ebook::GetByIdentifier(). Doesn't this mean that the ebook will always match, since we just saved it to the DB, and then we retrieved the row we just saved?

If there are no bugs, the two Ebook objects will always match. If there was a bug in writing data to the DB or an any Accessor methods like GetGitCommits(), GetTags(), etc., there will be differences.

During development, there were several times that the check found errors due to mistakes I made, so I found it useful. For example, I would miss a field or write data in the wrong order.

Secondly, why do we care if there are differences? The script emits an error if there are differences - but when does this matter?

Knowing that there are no differences gives me the confidence to say the site can switch to the DB as the source of truth and users won't notice a difference.

Thirdly, the script emits an error after already having made changes. This is surprising - I would expect that when an error occurs, we stop processing and no changes are saved, instead of saving changes and then letting the user know that something went wrong.

That's a fair criticism. My approach was to answer the question: "Does reading an Ebook from the filesystem match reading an Ebook from the DB?" In order to answer that question, the Ebook has to be written to the DB. Maybe there's another approach.

I can tell by your questions that you're considering removing the check from the script. If so, that's fine with me. No hard feelings. Maybe you'd rather have a second script, e.g., compare-filesystem-and-database, that could be run periodically to check for differences. But it seems more useful to know right when a problem crops up, and I'm sure there will be bugs in the future.

acabal · 2024-10-02T22:39:01Z

OK, so it sounds like it's for debugging, which is fine. We can keep it for now as a transitional debugging tool but once we're fully on the DB we can remove it.

I think that if Ebook::CreateOrUpdate() fails, it should throw an exception. This method can fail many different ways, but we know all of the many failure cases (property not filled, failed getting commit, whatever) and the user should be alerted at the time of failure, not after it has already commited data to the DB. Checking after the fact with a complex diff function is not the robust way to handle unexpected input - by doing this we are admitting that our own function is not robust enough to trust.

It already does throw a ValidationException which is uncaught by this script. So a failure in filling out required fields should result in that exception being thrown and then we display it nicely for the user before aborting. I pointed out a few places in the code where items where not fully validated, so maybe fixing those will make this check not necessary. Remember when validating we must check that a property isset() and (typically) not empty because those are all failures of validation.

colagrosso · 2024-10-05T07:17:37Z

So a failure in filling out required fields should result in that exception being thrown and then we display it nicely for the user before aborting.

I updated scripts/update-ebook-database to take a whack at this. I deliberately introduced some errors to Identifier and RepoFilesystemPath, and this is what the script printed:

(Edit: Added the class names when printing validation exceptions.)

$ nice /standardebooks.org/web/scripts/deploy-ebook-to-www --verbose --no-build --no-images --no-recompose --no-epubcheck --no-feeds --no-bulk-downloads virginia-woolf_to-the-lighthouse.git
Entering /standardebooks.org/ebooks/virginia-woolf_to-the-lighthouse.git
Updating/inserting ebook in database ... 
ValidationExceptions when updating the ebook database:
        Exceptions\EbookIdentifierRequiredException: Ebook Identifier required.
        Exceptions\InvalidEbookRepoFilesystemPathException: Invalid RepoFilesystemPath. Not readable: garbage-path
[ERROR] Found 2 errors

acabal · 2024-11-08T02:18:26Z

It's been live for a few days and everything looks good. Great work Mike!

colagrosso self-assigned this Feb 4, 2024

colagrosso mentioned this issue Feb 6, 2024

Initial Ebook DB schema, update script, and Ebook methods #335

Closed

colagrosso mentioned this issue May 24, 2024

Complete the DB write path #366

Merged

colagrosso added a commit to colagrosso/se-web that referenced this issue May 29, 2024

Correct errors found in standardebooks#336

3d08725

acabal pushed a commit that referenced this issue May 29, 2024

Correct errors found in #336

2061dc6

colagrosso added a commit that referenced this issue Jun 26, 2024

Correct errors found in #336

4a0db7a

colagrosso added a commit that referenced this issue Jun 30, 2024

Correct errors found in #336

91a377a

colagrosso added a commit that referenced this issue Jul 9, 2024

Correct errors found in #336

b40808c

colagrosso added a commit that referenced this issue Aug 23, 2024

Correct errors found in #336

d4ea916

colagrosso added a commit that referenced this issue Sep 5, 2024

Correct errors found in #336

78b61f0

colagrosso added a commit that referenced this issue Sep 17, 2024

Correct errors found in #336

15a0782

colagrosso added a commit that referenced this issue Sep 23, 2024

Correct errors found in #336

3643ff9

colagrosso mentioned this issue Oct 2, 2024

Merge db-rewrite into master #401

Merged

colagrosso added a commit that referenced this issue Oct 23, 2024

Correct errors found in #336

3e72195

colagrosso added a commit that referenced this issue Oct 31, 2024

Correct errors found in #336

e35a5ea

acabal pushed a commit that referenced this issue Nov 4, 2024

Correct errors found in #336

1589dda

acabal closed this as completed Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert ebooks storage to the database #336

Convert ebooks storage to the database #336

colagrosso commented Feb 4, 2024

colagrosso commented Feb 4, 2024

acabal commented Feb 5, 2024

acabal commented Feb 5, 2024

colagrosso commented Sep 23, 2024

robinwhittleton commented Sep 23, 2024

colagrosso commented Sep 23, 2024 •

edited

Loading

acabal commented Oct 2, 2024

acabal commented Oct 2, 2024

colagrosso commented Oct 2, 2024

colagrosso commented Oct 2, 2024

acabal commented Oct 2, 2024 •

edited

Loading

colagrosso commented Oct 5, 2024 •

edited

Loading

acabal commented Nov 8, 2024

Convert ebooks storage to the database #336

Convert ebooks storage to the database #336

Comments

colagrosso commented Feb 4, 2024

colagrosso commented Feb 4, 2024

acabal commented Feb 5, 2024

acabal commented Feb 5, 2024

colagrosso commented Sep 23, 2024

1. Search

Full word vs. word fragments

Stopwords

Author sorting with Unicode characters

2. Performance

robinwhittleton commented Sep 23, 2024

colagrosso commented Sep 23, 2024 • edited Loading

acabal commented Oct 2, 2024

acabal commented Oct 2, 2024

colagrosso commented Oct 2, 2024

colagrosso commented Oct 2, 2024

acabal commented Oct 2, 2024 • edited Loading

colagrosso commented Oct 5, 2024 • edited Loading

acabal commented Nov 8, 2024

colagrosso commented Sep 23, 2024 •

edited

Loading

acabal commented Oct 2, 2024 •

edited

Loading

colagrosso commented Oct 5, 2024 •

edited

Loading