-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minhash dedup causes local machine to hang. #222
Comments
What stage is hanging? |
It hanged at stage 2. FYI, it took 6 hours to complete stage 1. |
I am not super sure about the version we currently have on PyPI, but you should be able to set |
Thank you I'll try that. |
Currently my goal is to deduplicate ~750GB text (around 750 jsonl files, each is 1GB). My machine has 1TB RAM, 256 CPU cores. I used the following config to run Minhash Deduplication but then my machine hanged for more than 24 hours. I couldn't even Ctrl+C the process that I had to reboot the server.
I did limit the
workers
andtasks
to be lower than the number of my CPU cores, so I'm pretty clueless what is the reason to cause my server to hang. Please suggest me a better config to run minhash smoothly.The text was updated successfully, but these errors were encountered: