Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Speed up server import process #1597

Open
TheSeventhCode opened this issue Nov 24, 2022 · 12 comments
Open

[Enhancement] Speed up server import process #1597

TheSeventhCode opened this issue Nov 24, 2022 · 12 comments
Labels
enhancement New feature or request Project for volunteers The team has no plans to work on it (e.g. lack of time) but an external contribution is accepted

Comments

@TheSeventhCode
Copy link

To give some context, I have a photo library of around 350k photos and videos.
Luckily, Lychee can import things per symlink and skip duplicates, which makes it good for having these photos in just one location. The problem is the repeated importing. At least every two weeks, there are multiple new photos and videos added to that library. To show them in Lychee, an import from the server has to be run, which skips duplicates and makes symlinks.

The problem here is, that the checking of the duplicates takes too long. With the current implementation, it would take hours upon hours of just going through everything to see if it's already present and in the same condition.

I'm not sure what everything is done during the duplicate skip, but if it's just checking if the file already exists, it seems to be quite inefficient in that.
If it does more, can that “more” be disabled? I don't “change” already present content, so the only thing that has to be checked for is new file paths.

I hope there is some solution to that; otherwise, it makes it difficult to use this tool for this kind of library sizes.

@ildyria
Copy link
Member

ildyria commented Nov 28, 2022

I'm not sure what everything is done during the duplicate skip, but if it's just checking if the file already exists, it seems to be quite inefficient in that.

The way we check for duplicates is straightforward:
compute the hash of the picture, search in the DB for any collision. There is not much that can be done in that regard. :(

A lighter way to check that would be to check the file name and verify that it does not already exist in the database.
But this put a risk in the case of generic names like _R5_1234.jpg

Proposition:

  • option to only check against file names instead of computing the hash before checking for collision. This setting should be disabled by default and a CLI warning should be provided to the user.
    Do note that this optimization is only true if the hash take the most time, the DB query might be the choke point...

@ildyria ildyria added the enhancement New feature or request label Nov 28, 2022
@TheSeventhCode
Copy link
Author

That could work. The hashing function might reduce the import quite a bit.
But how exactly is the DB built? I'm sure some things there could be sped up as well. Even if for every image there was made a DB query, if the complete file path or something like that is the primary key or an index in general, that shouldn't take too long, would it?

@ildyria
Copy link
Member

ildyria commented Nov 30, 2022

Each image in Lychee is associated to a row in the photos table in the database.
On import the checksum of the image is store in the database (so that we can check duplicated).
And obviously the checksum column in that table is indexed to speed up the search.

There are multiple things that can take time:

  1. computing the hash of the image.
  2. executing the SQL query.

To give you an idea this is an illustration of the time required to access data on specific parts of your computer.

Cache L1 - There is a sandwich in front of you.
Cache L2 - Walk to the kitchen and make a sandwich
RAM - Drive to the store, purchase sandwich fixings, drive home and make sandwich
HDD - Drive to the store. Purchase seeds. Grow seeds..... .... ... Harvest lettuce, wheat, etc. Make sandwich.

Even if those two may seem relatively fast as single event, when do you a 1 by 1 process it will still be slow in the end.

Unless we use a different strategy for such processing, the only gains to be done are by optimizing the sequential process.
If PHP could be multi threaded, it would make it significantly easier as we could use parallelism over the list of images.

@TheSeventhCode
Copy link
Author

Yeah, if multi-threading was available, I'm sure quite a few things could be sped up. How much do you think disk speed determines the import process?
Like going from a normal SATA SSD to something like an M.2 or so.

In general, the different factors would be:

  • Disk speed (where the images are)
  • SQL-Queries
  • Hash function

I wonder now what would have the most impact.

@Babyforce
Copy link

Babyforce commented Apr 7, 2023

Hello, I just wanted to add my 2 cents about this issue. I have a similar problem where syncing the files takes forever but not because of the number (which I am aware would increase the sync time no matter what) but rather the big files I host. I have quite a lot of files that are above 200MB each and the CPU the server uses is just an Intel Celeron. Computing the hash of those big files takes an enormous amount of time, so much that I would consider it to be a waste of time and energy (in those hard times where power costs much more).

I would think just checking the names instead of the checksum would be a huge gain of time and energy (while also not allowing lychee to find real file changes but my files are not subject to those changes so I could afford that).

If accuracy is an issue, maybe saving the exact number of byte per file in the database and comparing to the real files would be more efficient and be a bit more accurate? I would think it is extremely unlikely that a modified file would have the exact same number of bytes as the previous one (but maybe I'm wrong).

I started working on a dirty bash script for this purpose (that uses the API) and sadly I do not know PHP enough to even try adding this functionality myself. Not having such a feature is really blocking for me as I cannot start using Lychee at all unless I decide to manually add albums by hand which would be really tedious... Everything is hosted on a SATA SSD so it should be fast enough.

@bushibot
Copy link

Yeah I'm trouble importing large folders (thousand or two images). Every time it gets hung on the session times out or whatever it has to start over. Seems like it should maybe be able to track what was last processed and only restart for new items? Or items added since last run date?

@ildyria
Copy link
Member

ildyria commented Sep 25, 2023

@bushibot Your use case would be better suited by using command line: https://github.com/LycheeOrg/Lychee/blob/master/app/Console/Commands/Sync.php

@bushibot
Copy link

bushibot commented Sep 25, 2023

@bushibot Your use case would be better suited by using command line: https://github.com/LycheeOrg/Lychee/blob/master/app/Console/Commands/Sync.php

I’d like to understand more about how to do that? It doesn’t seem quite as straightforward as open up the console and typing sync.php. Getting data in has already turned into a multi day project and it’s only getting worse… cli running the background might help it be more robust 😝

@d7415
Copy link
Contributor

d7415 commented Sep 25, 2023

I’d like to understand more about how to do that?

This should hopefully cover it:
https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

@bushibot
Copy link

I’d like to understand more about how to do that?

This should hopefully cover it: https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

Cool, not sure how I missed that but thanks.
That said how would Use with symlink rather then copy? I see it in the script, just not sure how to flag for it?

@bushibot
Copy link

I’d like to understand more about how to do that?

This should hopefully cover it: https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

Or not so easy. I'm running on unraid docker image. I opned a console to test but got back

root@35539f6d2b63:/photodump# php artisan lychee:sync /Kittens
Could not open input file: artisan

@ildyria ildyria added the Project for volunteers The team has no plans to work on it (e.g. lack of time) but an external contribution is accepted label Dec 27, 2023
@x1ntt
Copy link
Contributor

x1ntt commented Mar 14, 2024

我想了解更多关于如何做到这一点的信息?

这应该有望涵盖它:https://lycheeorg.github.io/docs/faq_general.html#can-i-set-up-lychee-to-watch-a-folder-for-new-images-and-automatically-add-them-to-albums

或者没那么容易。我在 unraid docker 映像上运行。我使用控制台进行测试,但又回来了

root@35539f6d2b63:/photodump# php artisan lychee:sync /Kittens
Could not open input file: artisan

To solve this problem, you should stay where the 'artisan' file is and pass in the absolute path of the path to be imported, like the following

root@038a06b0c059:/var/www/html/Lychee# php artisan lychee:sync /uploads/import/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Project for volunteers The team has no plans to work on it (e.g. lack of time) but an external contribution is accepted
Projects
None yet
Development

No branches or pull requests

6 participants