Expand documentation - distributed training #2

maharajamihir · 2024-10-23T15:47:11Z

Hey Gabriel,

Cool project! I am interested in working with it. Could you expand on the documentation on performing distributed training with this? Training on multiple GPUs on a single node would be a great start.

Mihir

gpoesia · 2024-10-30T06:05:00Z

Thanks for the interest, Mihir! Sorry I've been a bit busy with deadlines, but I'll aim to explain the setup better this week!

I'll warn that unfortunately the current distributed setup still needs a bit of babysitting - I set it up very manually, with a Redis server serving as a job queue, Celery workers that I spawn by hand (celery -A worker worker --concurrency=1 -n redis-server-ip-address:port), and the main process that loads the queue with jobs (it does conjecturing), and collects results and does fine-tuning (this process is bootstrap.py - set the REDIS environment variable to redis-server:port like in the worker, and run with DISTRIBUTED=1). It's likely due to me not knowing how to set it up properly, but Redis would eventually fill up and not evict keys (despite my attempts at using LRU and limiting memory), and so stop to take new jobs, so I'd have to restart it (but bootstrap.py checkpoints everything, so you can pick up from the last iteration if you point it to a checkpoint).

In the longer run, I'm intending to revamp the whole distributed setup with Ray -- it does what I did by hand in a much cleaner way, and looks like it won't be a lot of work to switch. But it will take me a month or so before I switch back to this!

emergenz · 2024-10-30T12:02:03Z

Nice. This seems to work for me for single-node setups. I added a PR with docs + launchers in #5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand documentation - distributed training #2

Expand documentation - distributed training #2

maharajamihir commented Oct 23, 2024

gpoesia commented Oct 30, 2024

emergenz commented Oct 30, 2024

Expand documentation - distributed training #2

Expand documentation - distributed training #2

Comments

maharajamihir commented Oct 23, 2024

gpoesia commented Oct 30, 2024

emergenz commented Oct 30, 2024