[Root] Ryzen 3600, [Worker1] Ryzen 3600, [Network] 1 Gigabit #41

DifferentialityDevelopment · 2024-05-08T22:05:32Z

DifferentialityDevelopment
May 8, 2024

Distributed Llama Version: 0.3.1
Model: Llama 3 8B Q40 (huggingface)
Switch: Tenda Gigabit Switch (Generic)

Metric	Threads = 4	Threads = 8	Threads = 16
Tokens / s	1.55	2.15	0.04
Avg generation time	642.69 ms	463.19 ms	21477.31 ms
Avg inference time	288.19 ms	181.31 ms	16074.94 ms
Avg transfer time	6.17 ms	281.00 ms	5401.44 ms

Couple of things I discovered during my test, similiar to how workers must be a multiple of 2 ie 2, 4, 8 etc
Threads are subject to the same condition, if you specify an odd number of threads it will crash with:
Assertion `size % nThreads == 0' failed.

Another thing I noticed that's counter intuitive is that while increasing thread count gave noticeable improvement in tokens/s, the speed would tank if the amount of threads exceeded the amount of logical processors available to the machine
It maxed out the cpu usage but the tokens/s tanked really hard, which I found very weird, for reference I let it run just to post the results of that.
For reference both machines have a Ryzen 3600 with 6 cores and 12 logical processors.

As a first step I've gotten it to run locally, I was surprised to find that the inference speed was so similar to Raspberry Pi 4B's, despite the hardware being so much faster.

DifferentialityDevelopment · 2024-05-08T22:09:33Z

DifferentialityDevelopment
May 8, 2024
Author

I was really surprised to that 8 Raspberry Pi 8B's were faster than my two Ryzen 3600's
I'd love to do a test to see if the speed is faster if I switch over to a 2.5G switch and ethernet cards

0 replies

b4rtaz · 2024-05-09T17:43:51Z

b4rtaz
May 9, 2024
Maintainer

Hello @DifferentialityDevelopment!

Threads are subject to the same condition, if you specify an odd number of threads it will crash

This was a bug, I fixed it here.

the speed would tank if the amount of threads exceeded the amount of logical processors available to the machine

This is "expected". The application utilizes CPU fully, so if you set too high value for --nthreads, the threads will interfere with each other.

As a first step I've gotten it to run locally, I was surprised to find that the inference speed was so similar to Raspberry Pi 4B's, despite the hardware being so much faster.

AVX is not fully supported yet (only Q40 × Q80 matmul), but maybe this problem requires some investigation.

0 replies

b4rtaz · 2024-05-09T20:24:39Z

b4rtaz
May 9, 2024
Maintainer

@DifferentialityDevelopment what kind of buffer have you used in this test (--buffer-float-type)? Q80 or F32?

3 replies

DifferentialityDevelopment May 9, 2024
Author

Q80 I think?
Yeah just checked now, this was what I ran on my last run where the tokens/s dropped to an abysmal 0.04
sudo nice -n -20 ./main inference --model ../llama-3-8b-distributed-llama/dllama_meta-llama-3-8b_q40.bin --tokenizer ../llama-3-8b-distributed-llama/dllama_meta-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 16 --workers 192.168.1.3:9990

b4rtaz May 9, 2024
Maintainer

--nthreads 16 on CPU with 6 cores won't work good. You can try --nthreads 6 or --nthreads 12.

DifferentialityDevelopment May 9, 2024
Author

Will try again tonight will the latest release, still going over the code to try and understand how it all works :)
Thanks again!

DifferentialityDevelopment · 2024-05-14T15:50:20Z

DifferentialityDevelopment
May 14, 2024
Author

With the latest commit I went up to 2.6 tokens per second from 2.15 🔥
Well done!

🔶 G 380 ms I 207 ms T 172 ms S 1917438 kB R 442 kB Hello
🔶 G 403 ms I 188 ms T 215 ms S 510 kB R 442 kB World
🔶 G 389 ms I 189 ms T 200 ms S 510 kB R 442 kB !
🔶 G 381 ms I 184 ms T 197 ms S 510 kB R 442 kB
🔶 G 377 ms I 175 ms T 201 ms S 510 kB R 442 kB My
🔶 G 370 ms I 196 ms T 174 ms S 510 kB R 442 kB name
🔶 G 394 ms I 184 ms T 210 ms S 510 kB R 442 kB is
🔶 G 410 ms I 172 ms T 237 ms S 510 kB R 442 kB [
🔶 G 382 ms I 185 ms T 196 ms S 510 kB R 442 kB name
🔶 G 380 ms I 188 ms T 191 ms S 510 kB R 442 kB ],
🔶 G 379 ms I 188 ms T 190 ms S 510 kB R 442 kB and
🔶 G 371 ms I 191 ms T 179 ms S 510 kB R 442 kB I
🔶 G 374 ms I 181 ms T 193 ms S 510 kB R 442 kB 'm
🔶 G 424 ms I 174 ms T 249 ms S 510 kB R 442 kB a
🔶 G 385 ms I 180 ms T 204 ms S 510 kB R 442 kB [
🔶 G 398 ms I 183 ms T 214 ms S 510 kB R 442 kB age
🔶 G 361 ms I 173 ms T 187 ms S 510 kB R 442 kB ]-
🔶 G 369 ms I 182 ms T 187 ms S 510 kB R 442 kB year
🔶 G 386 ms I 183 ms T 202 ms S 510 kB R 442 kB -old
🔶 G 385 ms I 190 ms T 194 ms S 510 kB R 442 kB [
Generated tokens: 20
Avg tokens / second: 2.60
Avg generation time: 384.90 ms
Avg inference time: 184.65 ms
Avg transfer time: 199.60 ms

6 replies

DifferentialityDevelopment May 14, 2024
Author

For certain!, it's a dirt cheap gigabit switch.

DifferentialityDevelopment May 14, 2024
Author

If I run it without any workers

sudo nice -n -20 ./main inference --steps 20 --prompt "Hello World! " --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 1
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
🕒 ropeCache: 32768 kB
⏩ Loaded 6175568 kB
🔶 G 272 ms I 271 ms T 1 ms S 0 kB R 0 kB Hello
🔶 G 267 ms I 267 ms T 0 ms S 0 kB R 0 kB World
🔶 G 292 ms I 288 ms T 4 ms S 0 kB R 0 kB !
🔶 G 269 ms I 269 ms T 0 ms S 0 kB R 0 kB
🔶 G 276 ms I 274 ms T 1 ms S 0 kB R 0 kB I
🔶 G 267 ms I 265 ms T 1 ms S 0 kB R 0 kB 'm
🔶 G 270 ms I 267 ms T 2 ms S 0 kB R 0 kB excited
🔶 G 266 ms I 265 ms T 1 ms S 0 kB R 0 kB to
🔶 G 272 ms I 272 ms T 0 ms S 0 kB R 0 kB be
🔶 G 270 ms I 268 ms T 1 ms S 0 kB R 0 kB here
🔶 G 272 ms I 272 ms T 0 ms S 0 kB R 0 kB and
🔶 G 270 ms I 268 ms T 1 ms S 0 kB R 0 kB share
🔶 G 272 ms I 272 ms T 0 ms S 0 kB R 0 kB my
🔶 G 272 ms I 269 ms T 2 ms S 0 kB R 0 kB experiences
🔶 G 270 ms I 269 ms T 1 ms S 0 kB R 0 kB and
🔶 G 268 ms I 268 ms T 0 ms S 0 kB R 0 kB knowledge
🔶 G 274 ms I 272 ms T 1 ms S 0 kB R 0 kB with
🔶 G 266 ms I 265 ms T 1 ms S 0 kB R 0 kB you
🔶 G 270 ms I 268 ms T 1 ms S 0 kB R 0 kB .
🔶 G 268 ms I 267 ms T 0 ms S 0 kB R 0 kB
Generated tokens: 20
Avg tokens / second: 3.69
Avg generation time: 271.15 ms
Avg inference time: 269.80 ms
Avg transfer time: 0.90 ms

DifferentialityDevelopment May 14, 2024
Author

The inference time is higher, but it just goes to show the transfer time also needs to be considered when running a model using distributed-llama.
I'd love to do tests with fibre optic connection between the machines, or with 2.5G network cards and a 2.5G switch.
I plan on getting 4 (1 root and 3 workers) Intel X99 dual CPU server motherboard combo's off aliexpress later in the year and hooking them all up with either 2.5G or 10G ethernet.
They ideal for this sort of setup as it's low cost, and enough memory and threads to run Llama 400B when it comes out.

b4rtaz May 14, 2024
Maintainer

I see in your logs:

2 nodes = Avg inference time: 184.65 ms
1 node  = Avg inference time: 269.80 ms

This gives 31% speed up. Here I got 34% between 1 node and 2 nodes. It's very interesting what takes ~15% the most, maybe the last layer 🤔.

zhengpeirong May 18, 2024

Let's calculate the transfer time theoretically.

llama3 8B

Since the transfer is full-duplex, there's no interference between uplink and downlink.
So, we can choose the bigger 510 kB as the transfer data volume to calculate the transfer time.

$$510000*8 bit/1G bps = 4.08ms$$

$$4.08ms/199.60 ms ~= 2\%$$

So, the average transfer time should be 4.08ms. However, your result is 199.60 ms, 50 times higher.
So, the network utilization ratio is merely 2%.

llama2 7B

For comparison, I summarize a similar model (llama2 7B) using different devices:

VMs

In this discussion, the Network Bandwidth is 20 Gbps, reference here.

$$590000*8 bit/20G bps = 0.236ms$$

$$0.236ms/7.62ms= 3\%$$

So, the network utilization ratio is merely 3%.
Similarly, we can calculate the result of 4 VMs to be 6%.

RaspberryPi

Also, the result of the Raspberry Pi cluster is calculated to be 9.0%, 48.0%, 14.1% for 2,4,8 devices.

llama2 13B
23.9%,25.75%, 9.8%
llama2 70B
8.5%

Summary

I think the network utilization, average around 11%, ranging from 2% to 48%, is under-optimized.
Developing the code possibly ensures a stable and high network utilization.

zhengpeirong · 2024-05-18T14:49:51Z

zhengpeirong
May 18, 2024

Nice! 8 threads? 199.60 ms for the transfer is very high, probably would be much better with a better switch/router.

Maybe the bottleneck is not the switch, it's the root node instead?
Is the CPU's memory occupied during inference, causing the socket connection to slow? Since the root node has many sockets to handle simultaneously, it's probably overloaded.

7 replies

DifferentialityDevelopment May 18, 2024
Author

I have 6 CPU cores and 12 threads and I ran it with 8 threads, CPU usage hanged around 80% +/-
Also do keep in mind on my setup, both machines are running windows, where distributed-llama runs within WSL, WSL or something linked to WSL could very well be the culprit in my scenario

zhengpeirong May 19, 2024

Yes, I agree with the WSL should be the reason why transfer speed is so low. And you can use iperf3 to test the bandwidth between the two hosts.

DifferentialityDevelopment May 19, 2024
Author

Thank you! I'll try that out.

DifferentialityDevelopment May 19, 2024
Author

Windows -> Windows

send_results
{
"cpu_util_total": 5.619798380000784,
"cpu_util_user": 1.0980067882772404,
"cpu_util_system": 4.5217915917235443,
"sender_has_retransmits": 0,
"streams": [{
"id": 1,
"bytes": 1189871616,
"retransmits": -1,
"jitter": 0,
"errors": 0,
"omitted_errors": 0,
"packets": 0,
"omitted_packets": 0,
"start_time": 0,
"end_time": 10.001838
}]
}
get_results
{
"cpu_util_total": 48.659833732262136,
"cpu_util_user": 15.7575630564483,
"cpu_util_system": 32.902270675813838,
"sender_has_retransmits": -1,
"streams": [{
"id": 1,
"bytes": 1187774464,
"retransmits": -1,
"jitter": 0,
"errors": 0,
"omitted_errors": 0,
"packets": 0,
"omitted_packets": 0,
"start_time": 0,
"end_time": 10.020372
}]
}
interval_len 0.993704 bytes_transferred 117833728
interval forces keep
[ 5] 9.01-10.00 sec 112 MBytes 949 Mbits/sec

[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.00 sec 1.11 GBytes 952 Mbits/sec sender
[ 5] 0.00-10.02 sec 1.11 GBytes 948 Mbits/sec receiver

WSL -> WSL

send_results
{
"cpu_util_total": 3.0076711234781408,
"cpu_util_user": 0.32323879328367783,
"cpu_util_system": 2.684432330194463,
"sender_has_retransmits": 1,
"congestion_used": "cubic",
"streams": [{
"id": 1,
"bytes": 518555600,
"retransmits": 0,
"jitter": 0,
"errors": 0,
"packets": 0,
"start_time": 0,
"end_time": 10.000094
}]
}
get_results
{
"cpu_util_total": 1.9707692031797179,
"cpu_util_user": 0.104310132895116,
"cpu_util_system": 1.8664627226641215,
"sender_has_retransmits": -1,
"congestion_used": "cubic",
"streams": [{
"id": 1,
"bytes": 514685192,
"retransmits": -1,
"jitter": 0,
"errors": 0,
"packets": 0,
"start_time": 0,
"end_time": 10.146
}]
}
interval_len 1.000003 bytes_transferred 49807360
interval forces keep
[ 5] 9.00-10.00 sec 47.5 MBytes 398 Mbits/sec 0 1.98 MBytes

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 495 MBytes 415 Mbits/sec 0 sender
[ 5] 0.00-10.15 sec 491 MBytes 406 Mbits/sec receiver

DifferentialityDevelopment May 19, 2024
Author

Interestingly, if I use --parallel 4 in WSL -> WSL

iperf3 -c 192.168.1.3 -p 9990 --parallel 4
Connecting to host 192.168.1.3, port 9990
[ 5] local 192.168.1.66 port 60922 connected to 192.168.1.3 port 9990
[ 7] local 192.168.1.66 port 60934 connected to 192.168.1.3 port 9990
[ 9] local 192.168.1.66 port 60942 connected to 192.168.1.3 port 9990
[ 11] local 192.168.1.66 port 60958 connected to 192.168.1.3 port 9990
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 19.7 MBytes 165 Mbits/sec 0 878 KBytes
[ 7] 0.00-1.00 sec 41.3 MBytes 346 Mbits/sec 0 1.85 MBytes
[ 9] 0.00-1.00 sec 23.4 MBytes 196 Mbits/sec 0 1.08 MBytes
[ 11] 0.00-1.00 sec 37.4 MBytes 314 Mbits/sec 0 1.75 MBytes
[SUM] 0.00-1.00 sec 122 MBytes 1.02 Gbits/sec 0

[ 5] 1.00-2.00 sec 37.5 MBytes 315 Mbits/sec 0 2.03 MBytes
[ 7] 1.00-2.00 sec 23.8 MBytes 199 Mbits/sec 0 2.15 MBytes
[ 9] 1.00-2.00 sec 15.0 MBytes 126 Mbits/sec 0 1.79 MBytes
[ 11] 1.00-2.00 sec 31.2 MBytes 262 Mbits/sec 0 2.10 MBytes
[SUM] 1.00-2.00 sec 108 MBytes 902 Mbits/sec 0

[ 5] 2.00-3.00 sec 35.0 MBytes 294 Mbits/sec 0 2.03 MBytes
[ 7] 2.00-3.00 sec 38.8 MBytes 325 Mbits/sec 0 2.15 MBytes
[ 9] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 1.79 MBytes
[ 11] 2.00-3.00 sec 40.0 MBytes 336 Mbits/sec 0 2.10 MBytes
[SUM] 2.00-3.00 sec 114 MBytes 954 Mbits/sec 0

[ 5] 3.00-4.00 sec 35.0 MBytes 294 Mbits/sec 0 2.03 MBytes
[ 7] 3.00-4.00 sec 35.0 MBytes 294 Mbits/sec 0 2.15 MBytes
[ 9] 3.00-4.00 sec 7.50 MBytes 62.9 Mbits/sec 0 1.67 MBytes
[ 11] 3.00-4.00 sec 35.0 MBytes 294 Mbits/sec 0 2.10 MBytes
[SUM] 3.00-4.00 sec 112 MBytes 944 Mbits/sec 0

[ 5] 4.00-5.00 sec 33.8 MBytes 283 Mbits/sec 0 2.03 MBytes
[ 7] 4.00-5.00 sec 33.8 MBytes 283 Mbits/sec 0 2.15 MBytes
[ 9] 4.00-5.00 sec 11.2 MBytes 94.4 Mbits/sec 0 1.71 MBytes
[ 11] 4.00-5.00 sec 33.8 MBytes 283 Mbits/sec 0 2.10 MBytes
[SUM] 4.00-5.00 sec 112 MBytes 944 Mbits/sec 0

[ 5] 5.00-6.00 sec 36.2 MBytes 304 Mbits/sec 0 2.03 MBytes
[ 7] 5.00-6.00 sec 36.2 MBytes 304 Mbits/sec 0 2.15 MBytes
[ 9] 5.00-6.00 sec 3.75 MBytes 31.5 Mbits/sec 0 1.52 MBytes
[ 11] 5.00-6.00 sec 36.2 MBytes 304 Mbits/sec 0 2.10 MBytes
[SUM] 5.00-6.00 sec 112 MBytes 944 Mbits/sec 0

[ 5] 6.00-7.00 sec 30.0 MBytes 252 Mbits/sec 0 2.03 MBytes
[ 7] 6.00-7.00 sec 30.0 MBytes 252 Mbits/sec 0 2.15 MBytes
[ 9] 6.00-7.00 sec 25.0 MBytes 210 Mbits/sec 0 2.08 MBytes
[ 11] 6.00-7.00 sec 30.0 MBytes 252 Mbits/sec 0 2.10 MBytes
[SUM] 6.00-7.00 sec 115 MBytes 965 Mbits/sec 0

[ 5] 7.00-8.00 sec 27.5 MBytes 231 Mbits/sec 0 2.03 MBytes
[ 7] 7.00-8.00 sec 27.5 MBytes 231 Mbits/sec 0 2.15 MBytes
[ 9] 7.00-8.00 sec 27.5 MBytes 231 Mbits/sec 0 2.08 MBytes
[ 11] 7.00-8.00 sec 27.5 MBytes 231 Mbits/sec 0 2.10 MBytes
[SUM] 7.00-8.00 sec 110 MBytes 923 Mbits/sec 0

[ 5] 8.00-9.00 sec 26.2 MBytes 220 Mbits/sec 0 2.03 MBytes
[ 7] 8.00-9.00 sec 31.2 MBytes 262 Mbits/sec 0 2.15 MBytes
[ 9] 8.00-9.00 sec 25.0 MBytes 210 Mbits/sec 0 2.08 MBytes
[ 11] 8.00-9.00 sec 32.5 MBytes 273 Mbits/sec 0 2.10 MBytes
[SUM] 8.00-9.00 sec 115 MBytes 965 Mbits/sec 0

[ 5] 9.00-10.00 sec 32.5 MBytes 273 Mbits/sec 0 2.03 MBytes
[ 7] 9.00-10.00 sec 32.5 MBytes 273 Mbits/sec 0 2.15 MBytes
[ 9] 9.00-10.00 sec 16.2 MBytes 136 Mbits/sec 0 2.05 MBytes
[ 11] 9.00-10.00 sec 31.2 MBytes 262 Mbits/sec 0 2.10 MBytes
[SUM] 9.00-10.00 sec 112 MBytes 944 Mbits/sec 0

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 313 MBytes 263 Mbits/sec 0 sender
[ 5] 0.00-10.27 sec 312 MBytes 255 Mbits/sec receiver
[ 7] 0.00-10.00 sec 330 MBytes 277 Mbits/sec 0 sender
[ 7] 0.00-10.27 sec 329 MBytes 268 Mbits/sec receiver
[ 9] 0.00-10.00 sec 155 MBytes 130 Mbits/sec 0 sender
[ 9] 0.00-10.27 sec 153 MBytes 125 Mbits/sec receiver
[ 11] 0.00-10.00 sec 335 MBytes 281 Mbits/sec 0 sender
[ 11] 0.00-10.27 sec 334 MBytes 273 Mbits/sec receiver
[SUM] 0.00-10.00 sec 1.11 GBytes 950 Mbits/sec 0 sender
[SUM] 0.00-10.27 sec 1.10 GBytes 921 Mbits/sec receiver

Then the throughput increases again to almost 1 gigabit speed

DifferentialityDevelopment · 2024-05-19T10:17:30Z

DifferentialityDevelopment
May 19, 2024
Author

If I run it with sudo nice -n 20 iperf3 -c 192.168.1.3 -p 9990
The throughput drops further (WSL -> WSL):

Connecting to host 192.168.1.3, port 9990
[ 5] local 192.168.1.66 port 60712 connected to 192.168.1.3 port 9990
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 49.1 MBytes 412 Mbits/sec 0 1.31 MBytes
[ 5] 1.00-2.00 sec 42.5 MBytes 357 Mbits/sec 0 1.31 MBytes
[ 5] 2.00-3.00 sec 40.0 MBytes 336 Mbits/sec 0 1.31 MBytes
[ 5] 3.00-4.00 sec 40.0 MBytes 336 Mbits/sec 0 1.31 MBytes
[ 5] 4.00-5.00 sec 40.0 MBytes 336 Mbits/sec 0 1.31 MBytes
[ 5] 5.00-6.00 sec 40.0 MBytes 336 Mbits/sec 0 1.37 MBytes
[ 5] 6.00-7.00 sec 40.0 MBytes 336 Mbits/sec 0 1.37 MBytes
[ 5] 7.00-8.00 sec 37.5 MBytes 315 Mbits/sec 0 1.37 MBytes
[ 5] 8.00-9.00 sec 42.5 MBytes 357 Mbits/sec 0 1.37 MBytes
[ 5] 9.00-10.00 sec 45.0 MBytes 377 Mbits/sec 0 1.37 MBytes

[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 417 MBytes 349 Mbits/sec 0 sender
[ 5] 0.00-10.21 sec 412 MBytes 338 Mbits/sec receiver

2 replies

zhengpeirong May 19, 2024

That's weird. Since a single socket can only enjoy half bandwidth while four threads can enjoy the total bandwidth simultaneously, then why don't programme the multi-thread sockets?

https://github.com/b4rtaz/distributed-llama/blob/main/src%2Fsocket.cpp#L151-L179

@b4rtaz
I think this will be the solution of transfer time.

b4rtaz May 21, 2024
Maintainer

@zhengpeirong very interesting. 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Root] Ryzen 3600, [Worker1] Ryzen 3600, [Network] 1 Gigabit #41

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Root] Ryzen 3600, [Worker1] Ryzen 3600, [Network] 1 Gigabit #41

DifferentialityDevelopment May 8, 2024

Replies: 6 comments · 18 replies

DifferentialityDevelopment May 8, 2024 Author

b4rtaz May 9, 2024 Maintainer

b4rtaz May 9, 2024 Maintainer

DifferentialityDevelopment May 9, 2024 Author

b4rtaz May 9, 2024 Maintainer

DifferentialityDevelopment May 9, 2024 Author

DifferentialityDevelopment May 14, 2024 Author

DifferentialityDevelopment May 14, 2024 Author

DifferentialityDevelopment May 14, 2024 Author

DifferentialityDevelopment May 14, 2024 Author

b4rtaz May 14, 2024 Maintainer

zhengpeirong May 18, 2024

Let's calculate the transfer time theoretically.

llama3 8B

llama2 7B

VMs

RaspberryPi

Summary

zhengpeirong May 18, 2024

DifferentialityDevelopment May 18, 2024 Author

zhengpeirong May 19, 2024

DifferentialityDevelopment May 19, 2024 Author

DifferentialityDevelopment May 19, 2024 Author

DifferentialityDevelopment May 19, 2024 Author

DifferentialityDevelopment May 19, 2024 Author

zhengpeirong May 19, 2024

b4rtaz May 21, 2024 Maintainer

DifferentialityDevelopment
May 8, 2024

Replies: 6 comments 18 replies

DifferentialityDevelopment
May 8, 2024
Author

b4rtaz
May 9, 2024
Maintainer

b4rtaz
May 9, 2024
Maintainer

DifferentialityDevelopment May 9, 2024
Author

b4rtaz May 9, 2024
Maintainer

DifferentialityDevelopment May 9, 2024
Author

DifferentialityDevelopment
May 14, 2024
Author

DifferentialityDevelopment May 14, 2024
Author

DifferentialityDevelopment May 14, 2024
Author

DifferentialityDevelopment May 14, 2024
Author

b4rtaz May 14, 2024
Maintainer

zhengpeirong
May 18, 2024

DifferentialityDevelopment May 18, 2024
Author

DifferentialityDevelopment May 19, 2024
Author

DifferentialityDevelopment May 19, 2024
Author

DifferentialityDevelopment May 19, 2024
Author

DifferentialityDevelopment
May 19, 2024
Author

b4rtaz May 21, 2024
Maintainer