llama inference with tensor parallelism #1403

stefpi · 2026-01-10T23:24:04Z

related to #2973 on mlx repo.

made llama.py use tensor parallelism if run with mlx.launch on 2 or more devices
also added class_predicate to quantize because it wasn’t working properly with quantized mlx models from HF

stefpi added 4 commits January 9, 2026 15:36

llama tp

6499e50

ffn fix

7e4a1b8

add support for TP in llama inference

9051e9d

cleanup

55f0e5c

stefpi mentioned this pull request Jan 10, 2026

[Docs] Simple example of using MLX distributed ml-explore/mlx#2973

Open

6 tasks

stefpi added 2 commits January 11, 2026 13:22

pre-commit formatting

fb16039

import shard_linear

704bab7

stefpi marked this pull request as ready for review January 15, 2026 18:09

stefpi changed the title ~~[WIP] llama inference with tensor parallelism~~ llama inference with tensor parallelism Jan 15, 2026

Provide feedback