You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cool project! I am interested in working with it. Could you expand on the documentation on performing distributed training with this? Training on multiple GPUs on a single node would be a great start.
Mihir
The text was updated successfully, but these errors were encountered:
Thanks for the interest, Mihir! Sorry I've been a bit busy with deadlines, but I'll aim to explain the setup better this week!
I'll warn that unfortunately the current distributed setup still needs a bit of babysitting - I set it up very manually, with a Redis server serving as a job queue, Celery workers that I spawn by hand (celery -A worker worker --concurrency=1 -n redis-server-ip-address:port), and the main process that loads the queue with jobs (it does conjecturing), and collects results and does fine-tuning (this process is bootstrap.py - set the REDIS environment variable to redis-server:port like in the worker, and run with DISTRIBUTED=1). It's likely due to me not knowing how to set it up properly, but Redis would eventually fill up and not evict keys (despite my attempts at using LRU and limiting memory), and so stop to take new jobs, so I'd have to restart it (but bootstrap.py checkpoints everything, so you can pick up from the last iteration if you point it to a checkpoint).
In the longer run, I'm intending to revamp the whole distributed setup with Ray -- it does what I did by hand in a much cleaner way, and looks like it won't be a lot of work to switch. But it will take me a month or so before I switch back to this!
Hey Gabriel,
Cool project! I am interested in working with it. Could you expand on the documentation on performing distributed training with this? Training on multiple GPUs on a single node would be a great start.
Mihir
The text was updated successfully, but these errors were encountered: