-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More output options #56
Comments
Hey @Chaphasilor and @MCOfficer Yes. I of course ran into issues like this. Biggest JSON is 6.23GB which still works on my local machine with 48GB RAM, but yes, it's bad 😂 JSON has some positive things, but the size part and the RAM part isn't any of those. One things still is good and that's the parent / child structure which you can't really nicely do in any line based thing like txt/csv etc. Thought about this issue before and wanted to completely rewrite it all to use SQLite, than RAM also doesn't matter while still scanning. When I see the queue turns to 100.000 I mostly stop scanning 🤣 The SQLite part is too much effort for NOW, but will try hopefully just around new year, no promises. The Reddit part is already logged in History.log. Error is not in a seperate log, but maybe could already be done by adding your own nlog.config, but not sure about that, because I changed that lately for the single file releases. The TXT part I added (ugly) and will be released in some minutes :) Already using CSV for other tools so I could also easily add that. Will do when I have some time. |
See https://github.com/KoalaBear84/OpenDirectoryDownloader/releases/tag/v1.9.2.6 for intermediate TXT saving. |
Yeah, we are aware of that. I thought about either explicitly listing the 'path' for each file (maybe using ids) or adding special lines that indicate the 'path' of each following file (url). Could be pseudo-nested, or explicit.
I'm not familiar with SQLite, isn't it a database? If it's just a file format that is easy to parse, that would be nice, but I believe a database would make ODD more complicated and harder to use/wrap.
I'll check it out tomorrow. Does it only contain the reddit markdown stuff? The reason I'm asking is because parsing stdout to find and extract that table is more of a workaround than a solution ^^
You're awesome! <3
Don't rush this. CSV was simply one format that came to mind for having easily-parsable files with items that contain meta-info. There might be better formats than this. |
Here's an idea making use of JSONLines. It's not pretty, but I don't believe one can actually represented nested structures in a streaming matter: { "type": "root", "url": "https://OD/", "subdirectories": ["public"] }
{ "type": "file", "url": "https://OD/test.txt", "name": "test.txt", "size": 351 }
{ "type": "directory", "url": "https://OD/public", "name": "public", "subdirectories": [] } No matter how you put it, it will be pretty hard to rebuild a nested structure from this dataformat, but that's what json is for. |
I believe IDs might help us out: Just rebuild only the structure of the OD with JSON, as compact as possible. Include just the dir names along with a random ID. Without all the files and meta-info. Put it as the first line. And then below that, for each ID, add the necessary meta info. If I'm not missing something obvious, this should make it possible to rebuild the nested structure? |
I just put a 'proof of concept' together here: I'm using a jsonlines-based format, where the first line contains the general info about the scan, the second line contains the directory structure and the following lines contain meta-info about the directories and files. The tool can take the regular JSON output and parse it into the new file format, for testing purposes. Only works with small files (<50kB), due to the limitations discussed above. It can also take the new file format and parse it into the old format, proving that the new format perfectly preserves all info and the OD structure, without many of the previous drawbacks. Would love to hear your thoughts on this @KoalaBear84 @MCOfficer :D Edit: The file format is just an example. We could use that one, but if there are even better ways to do it, I'm all for it! |
The file format would work - i lack the experience to design something better, tbh. One thing i found counterintuitive is that each file has an ID, which is actually the ID of its parent directory. Should either be named accordingly, or be moved to a per-file basis. |
Yeah, those keys could (should) be renamed. I'll think about a better naming scheme tonight! |
Okay, I've renamed the parameter to I also fixed a bug that caused the directory names to get lost in translation, now the only difference between the original file and the reconstructed file are the IDs and some slight reordering. From where I'm standing, the only thing left to do is implementing this in C#/.NET. If @KoalaBear84 could point me in the general direction, I'd be willing to contribute a PR to offer the new format alongside the currently available ones. On a different note: |
True. I'll take a look at it another time. Too much going on right now with the homeschooling part as an extra added bonus 😂 Also want to rewrite to a SQLite version. Then it doesn't matter at all how big a directory is. Now all directory structure is build up in memory exactly like it is in the JSON. But it's not particularly great for processing, as we all have experienced. Who would have imagined OD's with 10K directories and 10M files 🤣 And because SQLite has an implementation is nearly every language, it is portable. |
Does that mean SQLite can dump it's internal DB so we can import it somewhere else? Or how does SQLite help with exporting and importing scan results? |
SQLite is just a database, you can use a client for every programming language and read it, and import it wherever you want. |
Ah okay, I took a quick look at it but didn't think it through xD Makes sense 👌🏻 |
It feels wrong to use a database as data exchange format, but i can't seem to find any arguments against it. weird. |
No, this isn't any promise, as I see it's even more work than I thought. I have to rewrite a lot of code, Handle all the parent/subdirectory things with database keys/ids. But I've made a start. What I expected was right, it does get slower, this is because for every URL it wants to process it checks if it's already in it. Besides that, it needs to insert every directory and url on disk, which also takes time. For now it looks like an open directory with 950 directories and 15.228 files goes from 9 seconds to 16 seconds scanning/processing time. But... That is with still all queueing in memory, but all of that has to be rewritten to use the SQLite too. So.. Started it as a test, But 95% yet to be done, and this already took 4 hours to check 😮 |
I assume you are using one sqlite db for every scanned OD. in that case, you could maintain a HashSet (or whatever C#'s equivalent is) of all visited URLs, which is significantly faster to check against.
This idea may be more complicated and i lack the experience to judge it, but: This may also make the HashSet unnecessary, as in-memory DBs are typically blazingly fast. I'm not sure how much faster they are when reading though, because reading can be cached quite efficiently. |
Yes, HashSet is also the same in C#, have used it before, funny is that I didn't use this in the current implementation. I also want to make a 'resume' function, that you can pause the current scan, because you need to restart the server/computer, and continue the previous scan. HashSet is probably a good choice for this problem. Indeed, I have used "Data Source=:memory:" as well, was my first test, that inserted 10.000 urls in 450ms. Then changed to using disk, which takes 120 SECONDS for 10.000 urls, but, that was before performance optimizations 😇 I think that it will be fast enough. Especially when the OD will become very big, and we have no more memory issues. Also writing the URLs file will not depend on memory anymore and will be a lot faster when the OD has a lot of URLs. We can just query the database, and all will stream from database to file. Refactored some more now, rewrote all to native SQLite thing. Hopefully more news 'soon'. Ahh, looks like the 5 SQLite library dlls needed are only 260 kB, I expected MB's 😃 |
I believe our goal was to reduce memory usage, yes? 😉 Maybe a combination of HashSet and SQLite really is the way to go, combining speed with efficiency... But I guess @KoalaBear84 knows best 😇 |
Hmm. For the performance optimization I use a "write-ahead log", this works great, but 'pauses' every 500 or 1000 records/inserts. I was thinking, I might want to have some sort of queue for inserting the files, and process directories on the fly and do the files in a separate thread, this way we maybe can have both. Also a note for myself 😃 |
Also linking this issue to #20 😜 |
Coming back to this issues, did I understand it correctly that when using disk-based SQLite, the memory usage would be "near"-zero, no matter how large the OD? |
Yes, a resume feature would be awesome ! |
Well.. I sort of gave up on resume. Currently the whole processing of urls is depending on all data being in memory because it looks 'up' 4 levels of directories to see if we are in a recursive / endless loop.. It's very hard to rewrite everything, which costs a lot of time which I don't have / want to spend. 🙃 |
Is your feature request related to a problem? Please describe.
We are currently running into the problem that we have very large (3GB+) JSON files generated by ODD, but can't process them because we don't have enough RAM to parse the JSON.
I personally love JSON, but it seems like the format is not well-suited for the task (it's not streamable).
Now, you might ask, why don't you guys just use the .txt file?; the problem is that this is only created after the scan is finished, including file size estimations. After scanning a large OD for ~6h yesterday, I had a couple million links, with over 10M links left in queue for file size estimation. The actual urls were already there, but the only way to save them was through hitting
J
for saving as JSON.Describe the solution you'd like
There are multiple features that would be useful for very large ODs:
.txt
-filethis should be no problem at all and is simply a missing option/command at this point
think
jsonlines
,csv
, whateverit might also be a good idea to restructure the meta info of the scan and files in order to remove duplicate info and make the output files smaller and easier to work with
@MCOfficer and I would be glad to discuss the new file structure further, if you're so inclined :)
The text was updated successfully, but these errors were encountered: