Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mesh topology, distributed all layers. #136

Merged
merged 30 commits into from
Nov 21, 2024
Merged

Conversation

b4rtaz
Copy link
Owner

@b4rtaz b4rtaz commented Nov 13, 2024

Changes:

  • Tensor parallelism is now implemented for all layers! 🚀
  • Updated node synchronization: Previously, Distributed Llama used a master-slave architecture. This PR introduces a mesh topology for improved scalability and performance.
  • Support for Grok1 has been temporarily abandoned.
  • Deleted --kv-cache-storage argument support.

TODO:

  • mixtral - does not work now.

Measurements

2 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch

0.10.6

🔶 G  536 ms I  473 ms T   58 ms S    272 kB R    272 kB  the
🔶 G  510 ms I  445 ms T   64 ms S    272 kB R    272 kB  world
🔶 G  503 ms I  456 ms T   45 ms S    272 kB R    272 kB  better
🔶 G  517 ms I  445 ms T   71 ms S    272 kB R    272 kB .
🔶 G  573 ms I  521 ms T   51 ms S    272 kB R    272 kB  
🔶 G  560 ms I  446 ms T  111 ms S    272 kB R    272 kB ##
🔶 G  520 ms I  466 ms T   49 ms S    272 kB R    272 kB  Education
Generated tokens:    64
Avg tokens / second: 2.07
Avg generation time: 483.55 ms
Avg inference time:  419.66 ms
Avg transfer time:   61.56 ms

This PR

./dllama inference --model models/llama3_8b_q40/dllama_model_llama3_8b_q40.m --tokenizer models/llama3_8b_q40/dllama_tokenizer_llama3_8b_q40.t --buffer-float-type q80 --nthreads 4 --max-seq-len 1000 --prompt "Hello world" --steps 64 --workers 10.0.0.2:9999
🔶 G  358 ms I  277 ms T   79 ms S    555 kB R    538 kB  it
🔶 G  336 ms I  269 ms T   66 ms S    555 kB R    538 kB .
🔶 G  342 ms I  276 ms T   63 ms S    555 kB R    538 kB  

🔶 G  346 ms I  263 ms T   81 ms S    555 kB R    538 kB I
🔶 G  340 ms I  265 ms T   73 ms S    555 kB R    538 kB  want
🔶 G  327 ms I  280 ms T   45 ms S    555 kB R    538 kB  to
🔶 G  279 ms I  257 ms T   20 ms S    555 kB R    538 kB  be
🔶 G  307 ms I  277 ms T   29 ms S    555 kB R    538 kB  the
🔶 G  314 ms I  275 ms T   37 ms S    555 kB R    538 kB  best
Generated tokens:    64
Avg tokens / second: 3.36
Avg generation time: 297.98 ms
Avg inference time:  245.09 ms
Avg transfer time:   49.55 ms

62% faster. 🤯

@b4rtaz b4rtaz closed this Nov 15, 2024
@b4rtaz b4rtaz reopened this Nov 17, 2024
@b4rtaz b4rtaz marked this pull request as ready for review November 18, 2024 21:22
@b4rtaz b4rtaz changed the title feat: graph network. feat: graph network, distributed all layers. Nov 18, 2024
@b4rtaz b4rtaz changed the title feat: graph network, distributed all layers. feat: mesh topology, distributed all layers. Nov 18, 2024
@b4rtaz
Copy link
Owner Author

b4rtaz commented Nov 21, 2024

Llama 3.2 1B Q40, 4 x Raspberry PI 5 8GB

./dllama inference --steps 64 --prompt "Hello world" --model models/llama3_2_1b_instruct_q40/dllama_model_llama3_2_1b_instruct_q40.m --tokenizer models/llama3_2_1b_instruct_q40/dllama_tokenizer_llama3_2_1b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --temperature 0 --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998

0.10.6

🔶 G  105 ms I   85 ms T   19 ms S    204 kB R    204 kB .
Generated tokens:    64
Avg tokens / second: 9.90
Avg generation time: 101.03 ms
Avg inference time:  84.17 ms
Avg transfer time:   16.73 ms

0.11.0 (this PR)

🔶 G   54 ms I   20 ms T   34 ms S    228 kB R    579 kB  constantly
🔶 G   48 ms I   19 ms T   29 ms S    228 kB R    579 kB  learning
Generated tokens:    64
Avg tokens / second: 21.42
Avg generation time: 46.69 ms
Avg inference time:  18.81 ms
Avg transfer time:   27.59 ms

🤯 116.364% increase

Llama 3.2 3B Q40, 4 x Raspberry PI 5 8GB

./dllama inference --steps 64 --prompt "Hello world" --model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m --tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --max-seq-len 8192 --workers 10.0.0.1:9998 10.0.0.2:9998 10.0.0.4:9998

0.10.6

🔶 G  223 ms I  207 ms T   15 ms S    535 kB R    535 kB <|start_header_id|>
Generated tokens:    64
Avg tokens / second: 3.47
Avg generation time: 287.98 ms
Avg inference time:  262.70 ms
Avg transfer time:   23.91 ms

0.11.0 (this PR)

🔶 G  118 ms I   90 ms T   26 ms S    571 kB R    911 kB ###
🔶 G  114 ms I   99 ms T   12 ms S    571 kB R    911 kB  Running
Generated tokens:    64
Avg tokens / second: 9.01
Avg generation time: 110.97 ms
Avg inference time:  79.19 ms
Avg transfer time:   29.59 ms

🤯 159.654% increase

Llama 3 8B Q40, 4 x Raspberry PI 5 8GB

./dllama inference --steps 64 --prompt "Hello world" --model models/llama3_8b_q40/dllama_model_llama3_8b_q40.m --tokenizer models/llama3_8b_q40/dllama_tokenizer_llama3_8b_q40.t --buffer-float-type q80 --nthreads 4 --temperature 0 --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998

0.10.6

🔶 G  376 ms I  314 ms T   62 ms S    816 kB R    816 kB  interested
🔶 G  425 ms I  328 ms T   96 ms S    816 kB R    816 kB  in
Generated tokens:    64
Avg tokens / second: 2.83
Avg generation time: 353.73 ms
Avg inference time:  279.80 ms
Avg transfer time:   72.16 ms

0.11.0 (this PR)

🔶 G  229 ms I  122 ms T  106 ms S    864 kB R   1191 kB  in
Generated tokens:    64
Avg tokens / second: 4.67
Avg generation time: 214.31 ms
Avg inference time:  117.12 ms
Avg transfer time:   96.03 ms

🤯 65.0177% increase

@b4rtaz
Copy link
Owner Author

b4rtaz commented Nov 21, 2024

Llama 3.2 3B Q40, 2 x Raspberry PI 5 8GB

./dllama inference --steps 64 --prompt "Hello world" --model models/llama3_2_3b_instruct_q40/dllama_model_llama3_2_3b_instruct_q40.m --tokenizer models/llama3_2_3b_instruct_q40/dllama_tokenizer_llama3_2_3b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --max-seq-len 8192 --workers 10.0.0.1:9998

0.10.6

🔶 G  345 ms I  328 ms T   16 ms S    178 kB R    178 kB  model
🔶 G  335 ms I  318 ms T   14 ms S    178 kB R    178 kB  to
Generated tokens:    64
Avg tokens / second: 3.24
Avg generation time: 309.11 ms
Avg inference time:  297.36 ms
Avg transfer time:   10.03 ms

0.11.0 (this PR)

🔶 G  159 ms I  147 ms T   11 ms S    190 kB R    429 kB License
Generated tokens:    64
Avg tokens / second: 6.80
Avg generation time: 147.11 ms
Avg inference time:  135.39 ms
Avg transfer time:   9.55 ms

109.877% increase

Llama 3.2 1B Q40, 2 x Raspberry PI 5 8GB

./dllama inference --steps 64 --prompt "Hello world" --model models/llama3_2_1b_instruct_q40/dllama_model_llama3_2_1b_instruct_q40.m --tokenizer models/llama3_2_1b_instruct_q40/dllama_tokenizer_llama3_2_1b_instruct_q40.t --buffer-float-type q80 --nthreads 4 --temperature 0 --workers 10.0.0.3:9998

0.10.6

🔶 G  134 ms I   97 ms T   36 ms S     68 kB R     68 kB  constantly
🔶 G  104 ms I   97 ms T    7 ms S     68 kB R     68 kB  learning
Generated tokens:    64
Avg tokens / second: 8.44
Avg generation time: 118.45 ms
Avg inference time:  97.36 ms
Avg transfer time:   20.88 ms

0.11.0 (this PR)

🔶 G   48 ms I   36 ms T   12 ms S     76 kB R    318 kB  also
🔶 G   81 ms I   36 ms T   44 ms S     76 kB R    318 kB  generate
Generated tokens:    64
Avg tokens / second: 15.31
Avg generation time: 65.31 ms
Avg inference time:  36.58 ms
Avg transfer time:   28.61 ms

81.3981% increase

Llama 3 8B Q40, 2 x Raspberry PI 5 8GB

./dllama inference --steps 64 --prompt "Hello world" --model models/llama3_8b_q40/dllama_model_llama3_8b_q40.m --tokenizer models/llama3_8b_q40/dllama_tokenizer_llama3_8b_q40.t --buffer-float-type q80 --nthreads 4 --temperature 0 --workers 10.0.0.3:9998

0.10.6

🔶 G  589 ms I  512 ms T   77 ms S    272 kB R    272 kB  interested
🔶 G  569 ms I  470 ms T   99 ms S    272 kB R    272 kB  in
Generated tokens:    64
Avg tokens / second: 2.02
Avg generation time: 495.02 ms
Avg inference time:  412.02 ms
Avg transfer time:   81.03 ms

0.11.0 (this PR)

🔶 G  325 ms I  319 ms T    5 ms S    288 kB R    522 kB  in
Generated tokens:    64
Avg tokens / second: 3.44
Avg generation time: 290.48 ms
Avg inference time:  259.00 ms
Avg transfer time:   30.39 ms

70.297% increase

@b4rtaz b4rtaz merged commit 8b1cf89 into main Nov 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant