-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing bug with multiple workers #74
Comments
yeah, i've seen this too, I think it's an artifact of the file is being small. if you can think of a smarter method for this, I'm all for it. I've seen this before and also thought it was a bug |
Hmm let me take a look. Can you please reference the file and line of the function that does this? Would save me some time |
sorry, looked briefly and couldn't find it. I may be wrong. I don't believe though, this effects a file larger than a few pages. Please let me know if you can discover anything. the file-reader is here, and dumpster-dive is using percentages, so it could be a rounding-error too |
From my understanding, it picks a specific line in the file, at the 25% let's say (for the 2nd worker from 4 workers), so obviously it is very possible that it will not pick exactly the I did find this occuring in a large wikidump - the |
ah, ok. shoot. I didn't think it was happening, because duplicate pages throw errors on mongo-writes, and I didn't see any. |
I don't think it's a matter of duplicates, but rather a page split into two different workers, each worker not getting all the information it needs, and just skips it (to the next |
Hey,
I've found a bug occurring when using multiple workers.
Take for example the
tinywiki
dataset.When I run the following code:
where I pass the script the path to the
tinywiki
XML file throughargv[2]
which is./tests/tinywiki-latest-pages-articles.xml
.When I run it with 1 worker, I get the following print:
In contradiction to what I get when I run it with 4 workers (look at what happens to the Bodmin and Big Page text lengths):
Haven't looked at how the work is divided among the workers, but my guess is that the file is getting chopped in the middle of pages, making their text unreadable by the parser maybe?
Thanks!
The text was updated successfully, but these errors were encountered: