-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(server): library refresh go brrr #14456
base: main
Are you sure you want to change the base?
Conversation
0eb1440
to
80aa615
Compare
80aa615
to
8ecde3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start! I think there are still a lot of untapped potential improvements here.
The update to |
Thanks for your comments @mertalev ! I'll first attempt to do the import path and exclusion pattern checks in SQL and then move to your suggestions |
d394654
to
8b2a48c
Compare
6d69307
to
c26f6aa
Compare
c26f6aa
to
a3be620
Compare
.where({ isOffline: false }) | ||
.andWhere( | ||
new Brackets((qb) => { | ||
qb.where('originalPath NOT SIMILAR TO :paths', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use LIKE
instead of SIMILAR TO
.
The exclusions and import paths are also specific to a particular library, right? So you need to specify the library in the query.
Also, can you generate SQL for this and confirm with EXPLAIN ANALYZE
that it uses an index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the library parameter, thanks for spotting.
What is the rationale for using LIKE instead of SIMILAR TO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Postgres devs rather dislike it. Queries that can use LIKE should use LIKE, queries that need regex should use the regex operator.
775b817
to
69b273d
Compare
Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead |
Would that mean you queue them for metadata extraction even if they're unchanged? You can test it but I think it'd be more overhead than the stat calls. Edit: also if you do this with the source set to |
I was referring to new imports, files that are new to immich. I hoped to improve the ingest performance by removing the stat call. After testing, there are two issues:
If we can mitigate the two issues above, I can rewrite the library import feature and do that in batches as well! |
I don't see why fileModifiedAt needs a non-null constraint in the DB. Might just be an oversight that didn't matter because it didn't affect our usage. I think you can change the asset entity and generate a migration to remove that constraint. For sidecar files, maybe you could add |
I might just put new Date() in at the moment to keep the PR somewhat constrained. Regarding sidecars, I have thought about that, problem right now is that we're batching the crawled files in batches of 10k. It might be hard to do get that working alright. Maybe I'll just queue a sidecar discovery for every imported asset for now |
I think queueing sidecar discovery would introduce a race condition where it could run before, during or after metadata extraction. Since the refresh logic is already so much better, maybe leave the import for another PR so we can think about it more. |
f980219
to
8944a32
Compare
8944a32
to
96f2f65
Compare
Would this matter, because won't sidecar discovery re-queue metadata extraction afterwards if it does discover a sidecar file? I don't recall 100% if this is the behavior |
I haven't looked closely at the sidecar discovery job either, but I think it's an issue either way since it's possible for jobs dependent on the metadata to behave differently. For example, a sidecar file that changes the orientation of the image won't be respected during thumbnail generation. |
bbf20ab
to
d800c70
Compare
d800c70
to
845b3f7
Compare
This PR significantly improves library scanning performance. Wherever suitable, we are doing jobs in batches, and many looped database interactions are replaced with SQL queries.
User testimonials
"@etnoy what on earth have you done. I tried your PR and it finished the scan for 1M assets in 37 seconds down from 728s on main. It takes 188s just to finish queuing on main" -- @mertalev
Benchmark 1
A library scan with 22k items where nothing has changed since the last scan used to take 1m 22s, now it's below 10 seconds, an improvement of 87 percent!
Benchmark 2
A clean library import with 19k items takes 1m40s in main and 7 seconds in this PR.
NOTE: this benchmark is only the library service scan and does not include the metadata extraction. Also, some fs calls have been migrated from the library service to the metadata service, although this should only have a minor impact on overall scan performance
Benchmark 3
Importing a library with >5M assets.
Highlights:
Bonus:
TODO
TODO PRs
We need to branch out the following PRs before finishing this one
Allow a user to remove offline assets from trash