n2d-highmem-16 + 3 x n2d-highmem-8, Google Cloud [Llama 3.1 405B Q40] #113

b4rtaz · 2024-07-31T20:49:38Z

b4rtaz
Jul 31, 2024
Maintainer

Metric	n2d-highmem-16 + 3 x n2d-highmem-8
Tokens / s	0.23
Avg generation time	4289.97 ms
Avg inference time	3853.70 ms
Avg transfer time	435.47 ms

In this test, I wanted to measure the minimum requirements (RAM) for 4 devices. All nodes were started with --nthreads 4, even though the root VM had more cores (8). Therefore, in this test, I used a total of only 16 cores (AMD EPYC 7B13).

It seems 4 x 64 GB of RAM should be enought, but the root node must have ~4 GB or more of the disc swap to initialize the cluster. I couldn't run the cluster with 64 GB RAM and 0 MB swap.

Root (n2d-highmem-16, 8 cores, 128 GB RAM):

b4rtaz@instance-x:~$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 7B13
    CPU family:           25
    Model:                1
    Thread(s) per core:   2
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             0
    BogoMIPS:             4899.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat p
                          se36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1g
                          b rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apici
                          d tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2
                          apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_
                          legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext 
                          invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi
                          1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_n
                          i xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip
                           vaes vpclmulqdq rdpid fsrm
Virtualization features:  
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):      
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     4 MiB (8 instances)
  L3:                     32 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-15
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Mitigation; safe RET
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitizatio
                          n
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditiona
                          l; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Workers (3 x n2d-highmem-8, 4 cores, 64 GB RAM):

b4rtaz@instance-20240731-194352:~$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 7B13
    CPU family:           25
    Model:                1
    Thread(s) per core:   2
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             0
    BogoMIPS:             4899.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall
                           nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep
                          _good nopl nonstop_tsc cpuid extd_apicid tsc_known_fre
                          q pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2ap
                          ic movbe popcnt aes xsave avx f16c rdrand hypervisor l
                          ahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dn
                          owprefetch osvw topoext invpcid_single ssbd ibrs ibpb 
                          stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 
                          erms invpcid rdseed adx smap clflushopt clwb sha_ni xs
                          aveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_
                          save umip vaes vpclmulqdq rdpid fsrm
Virtualization features:  
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     2 MiB (4 instances)
  L3:                     32 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Mitigation; safe RET
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prct
                          l
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointe
                          r sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STI
                          BP conditional; RSB filling; PBRSB-eIBRS Not affected;
                           BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Logs

./dllama inference \
   --model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m \
   --tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t \
   --buffer-float-type q80 \
   --prompt "Hello world" \
   --steps 64 \
   --nthreads 4 \
   --workers 10.186.0.5:9999 10.186.0.6:9999 10.186.0.9:9999 --max-seq-len 2048
💡 arch: llama
💡 hiddenAct: silu
💡 dim: 16384
💡 hiddenDim: 53248
💡 nLayers: 126
💡 nHeads: 128
💡 nKvHeads: 16
💡 vocabSize: 128256
💡 origSeqLen: 131072
💡 seqLen: 2048
💡 nSlices: 4
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128009
📄 chatEosId: 128009
...
⏩ Loaded 232332352 kB
🔶 G 4172 ms I 3854 ms T  318 ms S 3650100 kB R  12852 kB Hello
🔶 G 4229 ms I 3847 ms T  382 ms S  12852 kB R  12852 kB  world
🔶 G 4285 ms I 3845 ms T  439 ms S  12852 kB R  12852 kB !
🔶 G 4289 ms I 3852 ms T  436 ms S  12852 kB R  12852 kB  This
🔶 G 4285 ms I 3852 ms T  432 ms S  12852 kB R  12852 kB  is
🔶 G 4268 ms I 3853 ms T  414 ms S  12852 kB R  12852 kB  a
🔶 G 4276 ms I 3850 ms T  425 ms S  12852 kB R  12852 kB  test
🔶 G 4282 ms I 3842 ms T  439 ms S  12852 kB R  12852 kB  page
...
🔶 G 4285 ms I 3876 ms T  408 ms S  12852 kB R  12852 kB  Final
Generated tokens:    64
Avg tokens / second: 0.23
Avg generation time: 4289.97 ms
Avg inference time:  3853.70 ms
Avg transfer time:   435.47 ms

b4rtaz@instance-20240731-201735:~/distributed-llama$ ./dllama worker --port 9999 --nthreads 4
Listening on 0.0.0.0:9999...
💡 sliceIndex: 3
💡 nSlices: 4
...
⏩ Received 442368 kB for block 119 (57632 kB/s)
⏩ Received 442368 kB for block 120 (57742 kB/s)
⏩ Received 442368 kB for block 121 (54302 kB/s)
⏩ Received 442368 kB for block 122 (53709 kB/s)
⏩ Received 442368 kB for block 123 (53242 kB/s)
⏩ Received 442368 kB for block 124 (53997 kB/s)
⏩ Received 442368 kB for block 125 (55683 kB/s)
🚁 Socket is in non-blocking mode

Costs

n2d-highmem-8 is $0.37/h in us-central1. This test achived 0.23 t/s, so it gives: ((1000000/0.23) / (60 * 60)) * (0.37 * 4) = $1787.44 / 1M tokens.

It would be amazing to compare costs of cloud with a cheap home setup with a similar hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n2d-highmem-16 + 3 x n2d-highmem-8, Google Cloud [Llama 3.1 405B Q40] #113

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

n2d-highmem-16 + 3 x n2d-highmem-8, Google Cloud [Llama 3.1 405B Q40] #113

b4rtaz Jul 31, 2024 Maintainer

Root (n2d-highmem-16, 8 cores, 128 GB RAM):

Workers (3 x n2d-highmem-8, 4 cores, 64 GB RAM):

Logs

Costs

Replies: 0 comments

b4rtaz
Jul 31, 2024
Maintainer