You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this test, I wanted to measure the minimum requirements (RAM) for 4 devices. All nodes were started with --nthreads 4, even though the root VM had more cores (8). Therefore, in this test, I used a total of only 16 cores (AMD EPYC 7B13).
It seems 4 x 64 GB of RAM should be enought, but the root node must have ~4 GB or more of the disc swap to initialize the cluster. I couldn't run the cluster with 64 GB RAM and 0 MB swap.
Root (n2d-highmem-16, 8 cores, 128 GB RAM):
b4rtaz@instance-x:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7B13
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 0
BogoMIPS: 4899.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat p
se36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1g
b rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apici
d tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2
apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_
legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext
invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi
1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_n
i xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip
vaes vpclmulqdq rdpid fsrm
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 4 MiB (8 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitizatio
n
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditiona
l; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
Workers (3 x n2d-highmem-8, 4 cores, 64 GB RAM):
b4rtaz@instance-20240731-194352:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7B13
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 0
BogoMIPS: 4899.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
ca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall
nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep
_good nopl nonstop_tsc cpuid extd_apicid tsc_known_fre
q pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2ap
ic movbe popcnt aes xsave avx f16c rdrand hypervisor l
ahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dn
owprefetch osvw topoext invpcid_single ssbd ibrs ibpb
stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2
erms invpcid rdseed adx smap clflushopt clwb sha_ni xs
aveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_
save umip vaes vpclmulqdq rdpid fsrm
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prct
l
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointe
r sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STI
BP conditional; RSB filling; PBRSB-eIBRS Not affected;
BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
Logs
./dllama inference \
--model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m \
--tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t \
--buffer-float-type q80 \
--prompt "Hello world" \
--steps 64 \
--nthreads 4 \
--workers 10.186.0.5:9999 10.186.0.6:9999 10.186.0.9:9999 --max-seq-len 2048
💡 arch: llama
💡 hiddenAct: silu
💡 dim: 16384
💡 hiddenDim: 53248
💡 nLayers: 126
💡 nHeads: 128
💡 nKvHeads: 16
💡 vocabSize: 128256
💡 origSeqLen: 131072
💡 seqLen: 2048
💡 nSlices: 4
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128009
📄 chatEosId: 128009
...
⏩ Loaded 232332352 kB
🔶 G 4172 ms I 3854 ms T 318 ms S 3650100 kB R 12852 kB Hello
🔶 G 4229 ms I 3847 ms T 382 ms S 12852 kB R 12852 kB world
🔶 G 4285 ms I 3845 ms T 439 ms S 12852 kB R 12852 kB !
🔶 G 4289 ms I 3852 ms T 436 ms S 12852 kB R 12852 kB This
🔶 G 4285 ms I 3852 ms T 432 ms S 12852 kB R 12852 kB is
🔶 G 4268 ms I 3853 ms T 414 ms S 12852 kB R 12852 kB a
🔶 G 4276 ms I 3850 ms T 425 ms S 12852 kB R 12852 kB test
🔶 G 4282 ms I 3842 ms T 439 ms S 12852 kB R 12852 kB page
...
🔶 G 4285 ms I 3876 ms T 408 ms S 12852 kB R 12852 kB Final
Generated tokens: 64
Avg tokens / second: 0.23
Avg generation time: 4289.97 ms
Avg inference time: 3853.70 ms
Avg transfer time: 435.47 ms
b4rtaz@instance-20240731-201735:~/distributed-llama$ ./dllama worker --port 9999 --nthreads 4
Listening on 0.0.0.0:9999...
💡 sliceIndex: 3
💡 nSlices: 4
...
⏩ Received 442368 kB for block 119 (57632 kB/s)
⏩ Received 442368 kB for block 120 (57742 kB/s)
⏩ Received 442368 kB for block 121 (54302 kB/s)
⏩ Received 442368 kB for block 122 (53709 kB/s)
⏩ Received 442368 kB for block 123 (53242 kB/s)
⏩ Received 442368 kB for block 124 (53997 kB/s)
⏩ Received 442368 kB for block 125 (55683 kB/s)
🚁 Socket is in non-blocking mode
Costs
n2d-highmem-8 is $0.37/h in us-central1. This test achived 0.23 t/s, so it gives: ((1000000/0.23) / (60 * 60)) * (0.37 * 4) = $1787.44 / 1M tokens.
It would be amazing to compare costs of cloud with a cheap home setup with a similar hardware.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In this test, I wanted to measure the minimum requirements (RAM) for 4 devices. All nodes were started with
--nthreads 4
, even though the root VM had more cores (8). Therefore, in this test, I used a total of only 16 cores (AMD EPYC 7B13).It seems 4 x 64 GB of RAM should be enought, but the root node must have ~4 GB or more of the disc swap to initialize the cluster. I couldn't run the cluster with 64 GB RAM and 0 MB swap.
Root (n2d-highmem-16, 8 cores, 128 GB RAM):
Workers (3 x n2d-highmem-8, 4 cores, 64 GB RAM):
Logs
Costs
n2d-highmem-8
is $0.37/h in us-central1. This test achived0.23 t/s
, so it gives:((1000000/0.23) / (60 * 60)) * (0.37 * 4) = $1787.44 / 1M tokens
.It would be amazing to compare costs of cloud with a cheap home setup with a similar hardware.
Beta Was this translation helpful? Give feedback.
All reactions