-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scanner uses a lot of memory #144
Comments
Hello! I'm surprised with your scanning times 😮. I have a library with around 29 000 tracks and it 8 minutes on my computer. But I know the scanner has this issue of opening media files twice (once for metadata and another to check if there's an embedded cover art) that most likely affects its performance.
Go ahead if you have your idea on how to improve the whole thing. I'm not closed to symlinks, but that could lead to potential issues if they create loops on the file system, or point to folders that are already tracked by supysonic. |
Performance might be different because my music database is on spinning rust, are you using a SSD?
Yeah, I think it might be worth flushing the good stuff once in a while. As things stand now, it looks like all the entries are committed to the database at once, at the very end, and it takes about a minute to do so (gross estimate). I don't have the feeling it can recover from an interruption either, which is a bummer when you're running a process that takes half an hour... ;)
So I think you're correct in that concatenating entries will cause Python to crash because it tries to decode the bytes into whatever the environment says it should be. And yes, a similar will occur on insert as well. The thing with pathnames in UNIX is they are not UTF-8. They can be encoded in any shape or form. In my filesystem, I have files encoded as ASCII, UTF-8, UTF-16 and ISO-8859-1. I know, it's bad, and I should clean that up, but it's not actually breaking any standard whatsoever. And the problem is that people are using various weird encodings for their filenames still. And because we exchange those files around, this is bound to cause problems again. So my position on filenames is they should always be treated as binary, until they are displayed to the user, at which point an educated guess should be made to try and show something meaningful. There are various ways to do that, including the surrogates techniques, but in any case, broken display shouldn't hurt scanning. I should also mention that I think Python 2 support is now overrated. ;) I think it's now perfectly acceptable for a "leaf" package like Supysonic to switch to Python 3 completely, even even require support for Python 3.5 for example. Py3 is well distributed and established in all major distributions and keeping Python 2 compatibility shims will just make your work harder for no good reason. ;) I'll see if I can come up with a better scanner. And preferably one that is verifiably better - do you have any advice on how I could hook performance tests to the test suite? Oh, and regarding symlinks I'd say: garbage in, garbage out. If people create loops in their filesystem, they get to see supysonic loop forever. Maybe they like stuff like that, I don't know why people would create such atrocities. ;) But from a security perspective, I think it's sufficient to have a flag (off by default) that allows users to say "I trust this filesystem, there are no loops" and enable symlink following. It would certainly be sufficient to my use case anyways. :) Thanks so much for your feedback, it's very valuable! |
Oh, and about encoding, there are things like chardet that can be used to make an educated guess about character encoding when utf-8 fails as well.. It's slow, but the idea is it's a fallback when all else fails. |
Yes.
So you're suggesting we store paths as binary? There are times where filenames are indeed presented to the user:
I'm patiently waiting for Python 2's EOL to remove all that compatibility layer. Doing it earlier crossed my mind but it seems some people still use Py2.7 for Supysonic (eg. #137).
err sorry no.
That's so simple I didn't even think about it :D |
Yes, actually. Or at least something (e.g. ISO-8859-1) that is 8bit doesn't need to be encoded/decoded.
Yeah that's fine. You can do the decode stuff then, but it's important to keep a handle on the binary in the back.
I honestly have very little patience for this nowadays. :) For libs I understand, because sometimes there's this one library that hasn't been ported so the entire project still needs to work with python 2. But Supysonic is not a library, really... It's a end-user program, so I don't get why someone would run it as Python 2. (Well, that's actually a lie: I had to run it as Python 2 on Debian stretch + Apache + WSGI because the WSGI bindings are py2 only. But then I could have set things up with fcgi as well and it should just have worked...)
Hehe.. Glad to be of service. ;) |
|
gah, didn't find that one when i looked i guess. weird. it was pretty late. :p |
Should prevent 'database is locked' issues Also helps reducing memory usage (#144)
Oh btw, the full scan no longer use a single database session (as Pony calls it) for the whole process. Please tell me if that helps with the original issue. |
Hi!
I've finally found the time to give supysonic a try, tired of dealing with the other Java monstrosities that populate the Subsonic ecosystem. ;) So, first: thanks for providing such an alternative, it's a breath of fresh air.
I have found that the initial scan of a large-ish music database (30 000+ songs) takes a long time (30 minutes with sqlite, 60 minutes with mysql) and a lot of memory (700MB). If I read the code correctly, entries are all recorded in memory before being flushed out to disk all at once. Is that correct?
I think it would be more efficient to write those entries directly to the database instead. It would be faster and less memory hungry.
As an aside, is there any reason why
os.walk
is not used to walk the file hierachy? I understand it used to be much slower thanlistdir
in previous versions, but since we're stat'ing all files anyways, the performance impact shouldn't be that bad. And besides, newer versions (py 3.5) useos.scandir
so it's actually much faster now... All I can see from the git history is 954c75b which says something about file permissions....If you're open to the idea, I would like to reimplement parts of the scanner so that it's more efficient and flexible. I would like, for example, to follow symlinks (#48) and ignore certain directories as well.
The text was updated successfully, but these errors were encountered: