Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand documentation - distributed training #2

Open
maharajamihir opened this issue Oct 23, 2024 · 2 comments
Open

Expand documentation - distributed training #2

maharajamihir opened this issue Oct 23, 2024 · 2 comments

Comments

@maharajamihir
Copy link
Contributor

Hey Gabriel,

Cool project! I am interested in working with it. Could you expand on the documentation on performing distributed training with this? Training on multiple GPUs on a single node would be a great start.

Mihir

@gpoesia
Copy link
Owner

gpoesia commented Oct 30, 2024

Thanks for the interest, Mihir! Sorry I've been a bit busy with deadlines, but I'll aim to explain the setup better this week!

I'll warn that unfortunately the current distributed setup still needs a bit of babysitting - I set it up very manually, with a Redis server serving as a job queue, Celery workers that I spawn by hand (celery -A worker worker --concurrency=1 -n redis-server-ip-address:port), and the main process that loads the queue with jobs (it does conjecturing), and collects results and does fine-tuning (this process is bootstrap.py - set the REDIS environment variable to redis-server:port like in the worker, and run with DISTRIBUTED=1). It's likely due to me not knowing how to set it up properly, but Redis would eventually fill up and not evict keys (despite my attempts at using LRU and limiting memory), and so stop to take new jobs, so I'd have to restart it (but bootstrap.py checkpoints everything, so you can pick up from the last iteration if you point it to a checkpoint).

In the longer run, I'm intending to revamp the whole distributed setup with Ray -- it does what I did by hand in a much cleaner way, and looks like it won't be a lot of work to switch. But it will take me a month or so before I switch back to this!

@emergenz
Copy link
Contributor

Nice. This seems to work for me for single-node setups. I added a PR with docs + launchers in #5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants