[Root] Ryzen 3600, [Worker1] Ryzen 3600, [Network] 1 Gigabit #41
Replies: 6 comments 18 replies
-
I was really surprised to that 8 Raspberry Pi 8B's were faster than my two Ryzen 3600's |
Beta Was this translation helpful? Give feedback.
-
Hello @DifferentialityDevelopment!
This was a bug, I fixed it here.
This is "expected". The application utilizes CPU fully, so if you set too high value for
AVX is not fully supported yet (only Q40 × Q80 matmul), but maybe this problem requires some investigation. |
Beta Was this translation helpful? Give feedback.
-
@DifferentialityDevelopment what kind of buffer have you used in this test ( |
Beta Was this translation helpful? Give feedback.
-
With the latest commit I went up to 2.6 tokens per second from 2.15 🔥 🔶 G 380 ms I 207 ms T 172 ms S 1917438 kB R 442 kB Hello |
Beta Was this translation helpful? Give feedback.
-
Maybe the bottleneck is not the switch, it's the root node instead? |
Beta Was this translation helpful? Give feedback.
-
If I run it with sudo nice -n 20 iperf3 -c 192.168.1.3 -p 9990 Connecting to host 192.168.1.3, port 9990 [ ID] Interval Transfer Bitrate Retr |
Beta Was this translation helpful? Give feedback.
-
Distributed Llama Version: 0.3.1
Model: Llama 3 8B Q40 (huggingface)
Switch: Tenda Gigabit Switch (Generic)
Couple of things I discovered during my test, similiar to how workers must be a multiple of 2 ie 2, 4, 8 etc
Threads are subject to the same condition, if you specify an odd number of threads it will crash with:
Assertion `size % nThreads == 0' failed.
Another thing I noticed that's counter intuitive is that while increasing thread count gave noticeable improvement in tokens/s, the speed would tank if the amount of threads exceeded the amount of logical processors available to the machine
It maxed out the cpu usage but the tokens/s tanked really hard, which I found very weird, for reference I let it run just to post the results of that.
For reference both machines have a Ryzen 3600 with 6 cores and 12 logical processors.
As a first step I've gotten it to run locally, I was surprised to find that the inference speed was so similar to Raspberry Pi 4B's, despite the hardware being so much faster.
Beta Was this translation helpful? Give feedback.
All reactions