feat(server): library refresh go brrr #14456

etnoy · 2024-12-02T21:37:45Z

This PR significantly improves library scanning performance. Wherever suitable, we are doing jobs in batches, and many looped database interactions are replaced with SQL queries.

User testimonials
"@etnoy what on earth have you done. I tried your PR and it finished the scan for 1M assets in 37 seconds down from 728s on main. It takes 188s just to finish queuing on main" -- @mertalev

Benchmark 1
A library scan with 22k items where nothing has changed since the last scan used to take 1m 22s, now it's below 10 seconds, an improvement of 87 percent!

Benchmark 2
A clean library import with 19k items takes 1m40s in main and 7 seconds in this PR.
NOTE: this benchmark is only the library service scan and does not include the metadata extraction. Also, some fs calls have been migrated from the library service to the metadata service, although this should only have a minor impact on overall scan performance

Benchmark 3
Importing a library with >5M assets.

Time to 1M imported (without metadata extraction): 6m50s

Highlights:

File paths crawled on disk are compared in sql to discard already-imported files
Modified files are scanned in batches, then a single db call updates all of them
Missing files are identified in batches, then a single db call marks all of them as offline
Import paths and exclusion patterns are matched against library assets in a single sql query

Bonus:

Greatly improved log messages related to library scans
More e2e tests for handling when offline files go back online, leading to one major bug fixed

TODO

More e2e tests for the in-db exclusion pattern matching
Test the library watcher

TODO PRs
We need to branch out the following PRs before finishing this one

Don't delete offline assets from disk when emptying trash: fix(server): don't delete offline files from disk when trash empties #14777
Better sidecar managment in external libraries: feat(server): Handle sidecars in external libraries #14800
~~Allow a user to remove offline assets from trash~~

mertalev

Nice start! I think there are still a lot of untapped potential improvements here.

server/src/services/library.service.ts

mertalev · 2024-12-04T18:08:32Z

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

etnoy · 2024-12-08T22:20:59Z

Thanks for your comments @mertalev ! I'll first attempt to do the import path and exclusion pattern checks in SQL and then move to your suggestions

mertalev · 2024-12-10T21:05:51Z

server/src/repositories/asset.repository.ts

+      .where({ isOffline: false })
+      .andWhere(
+        new Brackets((qb) => {
+          qb.where('originalPath NOT SIMILAR TO :paths', {


Use LIKE instead of SIMILAR TO.

The exclusions and import paths are also specific to a particular library, right? So you need to specify the library in the query.

Also, can you generate SQL for this and confirm with EXPLAIN ANALYZE that it uses an index?

Added the library parameter, thanks for spotting.

What is the rationale for using LIKE instead of SIMILAR TO?

The Postgres devs rather dislike it. Queries that can use LIKE should use LIKE, queries that need regex should use the regex operator.

server/src/repositories/asset.repository.ts

server/src/services/library.service.ts

etnoy · 2024-12-12T21:10:28Z

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead

mertalev · 2024-12-12T21:26:49Z

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead

Would that mean you queue them for metadata extraction even if they're unchanged? You can test it but I think it'd be more overhead than the stat calls.

Edit: also if you do this with the source set to upload, it would definitely be worse because it would queue a bunch of other things after metadata extraction.

etnoy · 2024-12-12T21:42:08Z

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead

Would that mean you queue them for metadata extraction even if they're unchanged? You can test it but I think it'd be more overhead than the stat calls.

Edit: also if you do this with the source set to upload, it would definitely be worse because it would queue a bunch of other things after metadata extraction.

I was referring to new imports, files that are new to immich. I hoped to improve the ingest performance by removing the stat call. After testing, there are two issues:

assetRepository.create requires mtime, which we can only get from stat. We could work around that by setting it to new Date(), but ideally it should be undefined
We still check for the existence of a sidecar, and this complicates things

If we can mitigate the two issues above, I can rewrite the library import feature and do that in batches as well!

mertalev · 2024-12-12T21:55:06Z

I don't see why fileModifiedAt needs a non-null constraint in the DB. Might just be an oversight that didn't matter because it didn't affect our usage. I think you can change the asset entity and generate a migration to remove that constraint.

For sidecar files, maybe you could add.xmp to the glob filter and enable the option to make the files come in sorted order? That way you could make sure they're in the same batch.

etnoy · 2024-12-12T22:20:59Z

I don't see why fileModifiedAt needs a non-null constraint in the DB. Might just be an oversight that didn't matter because it didn't affect our usage. I think you can change the asset entity and generate a migration to remove that constraint.

For sidecar files, maybe you could add.xmp to the glob filter and enable the option to make the files come in sorted order? That way you could make sure they're in the same batch.

I might just put new Date() in at the moment to keep the PR somewhat constrained.

Regarding sidecars, I have thought about that, problem right now is that we're batching the crawled files in batches of 10k. It might be hard to do get that working alright. Maybe I'll just queue a sidecar discovery for every imported asset for now

mertalev · 2024-12-12T22:56:16Z

I think queueing sidecar discovery would introduce a race condition where it could run before, during or after metadata extraction. Since the refresh logic is already so much better, maybe leave the import for another PR so we can think about it more.

zackpollard · 2024-12-17T15:12:44Z

I think queueing sidecar discovery would introduce a race condition where it could run before, during or after metadata extraction. Since the refresh logic is already so much better, maybe leave the import for another PR so we can think about it more.

Would this matter, because won't sidecar discovery re-queue metadata extraction afterwards if it does discover a sidecar file? I don't recall 100% if this is the behavior

mertalev · 2024-12-17T15:22:44Z

I haven't looked closely at the sidecar discovery job either, but I think it's an issue either way since it's possible for jobs dependent on the metadata to behave differently. For example, a sidecar file that changes the orientation of the image won't be respected during thumbnail generation.

…-app/immich into feat/inline-offline-check

etnoy added the changelog:enhancement label Dec 2, 2024

github-actions bot added the 🗄️server label Dec 2, 2024

etnoy force-pushed the feat/inline-offline-check branch from 0eb1440 to 80aa615 Compare December 2, 2024 21:45

feat: run all offline checks in a single job

8ecde3b

etnoy force-pushed the feat/inline-offline-check branch from 80aa615 to 8ecde3b Compare December 2, 2024 21:46

mertalev reviewed Dec 4, 2024

View reviewed changes

server/src/services/library.service.ts Outdated Show resolved Hide resolved

server/src/services/library.service.ts Outdated Show resolved Hide resolved

etnoy force-pushed the feat/inline-offline-check branch 2 times, most recently from d394654 to 8b2a48c Compare December 9, 2024 21:34

Merge remote-tracking branch 'origin' into feat/inline-offline-check

02c5765

etnoy force-pushed the feat/inline-offline-check branch 3 times, most recently from 6d69307 to c26f6aa Compare December 10, 2024 16:41

etnoy added the changelog:bugfix label Dec 10, 2024

etnoy force-pushed the feat/inline-offline-check branch from c26f6aa to a3be620 Compare December 10, 2024 20:39

etnoy removed the changelog:bugfix label Dec 10, 2024

etnoy changed the title ~~feat(server): run all offline checks in a single job~~ feat(server): library refresh go brrr Dec 10, 2024

mertalev reviewed Dec 10, 2024

View reviewed changes

server/src/services/library.service.ts Outdated Show resolved Hide resolved

etnoy force-pushed the feat/inline-offline-check branch 5 times, most recently from 775b817 to 69b273d Compare December 12, 2024 20:59

etnoy force-pushed the feat/inline-offline-check branch from f980219 to 8944a32 Compare December 12, 2024 23:14

do it in sql, baby

96f2f65

etnoy force-pushed the feat/inline-offline-check branch from 8944a32 to 96f2f65 Compare December 12, 2024 23:16

etnoy added 3 commits December 13, 2024 01:48

wip batch imports

3d7b924

Merge remote-tracking branch 'origin' into feat/inline-offline-check

26ffde6

Merge remote-tracking branch 'origin' into feat/inline-offline-check

3deeaad

etnoy force-pushed the feat/inline-offline-check branch 2 times, most recently from bbf20ab to d800c70 Compare December 18, 2024 00:15

etnoy added 2 commits December 18, 2024 01:24

Merge branch 'feat/inline-offline-check' of https://github.com/immich…

745958e

…-app/immich into feat/inline-offline-check

Merge remote-tracking branch 'origin' into feat/inline-offline-check

845b3f7

etnoy force-pushed the feat/inline-offline-check branch from d800c70 to 845b3f7 Compare December 18, 2024 00:24

asset count instead of statistics

1df1b85

github-actions bot added the 🖥️web label Dec 18, 2024

etnoy mentioned this pull request Dec 19, 2024

feat(server): Handle sidecars in external libraries #14800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): library refresh go brrr #14456

feat(server): library refresh go brrr #14456

etnoy commented Dec 2, 2024 •

edited

Loading

mertalev left a comment

mertalev commented Dec 4, 2024

etnoy commented Dec 8, 2024

mertalev Dec 10, 2024

etnoy Dec 18, 2024

mertalev Dec 19, 2024

etnoy commented Dec 12, 2024

mertalev commented Dec 12, 2024 •

edited

Loading

etnoy commented Dec 12, 2024

mertalev commented Dec 12, 2024 •

edited

Loading

etnoy commented Dec 12, 2024

mertalev commented Dec 12, 2024

zackpollard commented Dec 17, 2024 •

edited

Loading

mertalev commented Dec 17, 2024

feat(server): library refresh go brrr #14456

Are you sure you want to change the base?

feat(server): library refresh go brrr #14456

Conversation

etnoy commented Dec 2, 2024 • edited Loading

mertalev left a comment

Choose a reason for hiding this comment

mertalev commented Dec 4, 2024

etnoy commented Dec 8, 2024

mertalev Dec 10, 2024

Choose a reason for hiding this comment

etnoy Dec 18, 2024

Choose a reason for hiding this comment

mertalev Dec 19, 2024

Choose a reason for hiding this comment

etnoy commented Dec 12, 2024

mertalev commented Dec 12, 2024 • edited Loading

etnoy commented Dec 12, 2024

mertalev commented Dec 12, 2024 • edited Loading

etnoy commented Dec 12, 2024

mertalev commented Dec 12, 2024

zackpollard commented Dec 17, 2024 • edited Loading

mertalev commented Dec 17, 2024

etnoy commented Dec 2, 2024 •

edited

Loading

mertalev commented Dec 12, 2024 •

edited

Loading

mertalev commented Dec 12, 2024 •

edited

Loading

zackpollard commented Dec 17, 2024 •

edited

Loading