-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
production server with multiple instances #1184
Comments
Update on July 3, 2024
Essentially we have only Remaining issues:
|
The remaining issue is in #1181. To set up NFS, I followed this guide and did everything except ufw (firewall), which we might set up later if needed.
(Need |
Good practice to detach and delete a worker instance (all on the worker instance):
|
Updated on July 10, 2024
is the production server with a single vGPU instance. Anything except PACO training can run. GPU jobs are fast. All other jobs are slow. There's also a popup login window issue on the homepage, so if you need to create a new account, please do so and visit the activation link on your phone.
is the server with two instances (more information below).
Currently, nothing can be done on this server.There is no popup login window on the homepage. Account creation is normal. Anything except PACO training should be able to run on this server now. It should function the same as rodan2.simssa.ca but everything is significantly faster.I tested with two small vGPU instance today that it is possible to do something like "distributed computing" with Docker swarm so that we put containers on different instances. Since we do not know when we will get a free spot on Arbutus, I suggest we deploy production Rodan on two smaller instances instead of a single larger one.
g1-16gb-c8-40gb
which has 8 vCPUs and 40GB RAM with a vGPU of 16GM RAM.g1-8gb-c4-22gb
andp16-16gb
, for example, so that we have all containers exceptGPU-celery
run with 16 vCPUs and 16GB RAM, and theGPU-celery
container has 22GB RAM and 4 vCPUs to share withiipsrv
. Although this vGPU has 8 instead of 16GB RAM, it still seems to perform better than current the staging GPU.g1-8gb-c4-22gb
for everything. GPU jobs are fast but non-GPU jobs are really slow because we don't have enough vCPUs.Option 1 needs 8 vCPUs, and 40GB RAM, while option 2 needs 20 vCPUs and 38GB RAM. Since we are mainly running short of RAM after the extension, this way we actually have more vCPUs to improve performance of other non-GPU containers. We can also use the remaining 2GB RAM to have a separate instance for data storage.
Based on my experiments, we only need to deploy (and later update) the stack on the instance with the manager node and Docker swarm will handle the rest as long as we correctly join the network and label the worker node.
The only trouble I encountered so far is the worker instance for the GPU container needs to access data stored on non-GPU instance. But I believe this is possible. One common practice is using NFS, which I will try this week and report back.
Update on Jun 27, 2024
We already have one
g1-16gb-c8-40gb
vGPU instance, but with wrong OS. (Ubuntu 22.04 has wrong DNS resolution in Docker Swarm so we always get redis timeout error, after many trials. Since it will be extremely difficult to launch a new vGPU instance of this flavor and it was purely luck that enabled us to launch this one, I was hoping to make use of this server. However, it cannot work as a worker node. Nor can it be rebuilt with the correct OS.As a result, we will just delete this instance because it turns out to be almost useless for us.
Current plan:
In case we need a second server for classifying etc while staging is training, I will revert back the current production so that rodan2 can be used and do everything from scratch with new instances and port it to somewhere else.
Notes for picking the manager instance (without GPU):
We want more vCPUs for all other non GPU jobs to run smoothly. According to Arbutus, instance flavors are named as such
However, "c" instances are expensive in RAM. If we go with one
g1-8gb-c4-22gb
for GPU worker instance, we have around 40GB left. We can only affordc8-30gb-288
among "c" flavors. However, we can still getp16
with RAM options for 16, 24, and 32 GB. I will try withp16-32gb
first because we do not want to waste extra resources to prevent Compute Canada from down grading us for the next year.We now have rodan2 prod back for all tasks except PACO training with GPU. Distributed prod with one "p" instance and one vGPU instance still needs more testing before NFS can be deployed. There is a "broken pipe" for
rodan-main
with the "p" instance now.The text was updated successfully, but these errors were encountered: