This project is a playground built on MLX, a great framework for tinkering with small LLMs. It's all about learning by doing, breaking things, and (sometimes) making them work again.
The philosophy is simple: keeping things as straightforward as possible, trying to minimize abstractions and dependencies, while staying true to the awesome projects that introduced or inspired the showcased methods (proper credits and licenses below).
FAMLX focuses purely on local inference - no training, adapters are disabled and quantization is untested. The generate.py file constitutes an appropriate entry point as it's where the whole process of turning a prompt into words begins. From there you should be able to track everything happening under the hood.
uv is the quickest way to start. If you haven't done it already, download uv here
git clone https://github.com/pappitti/mlx-llms.git
uv sync
To generate text with an LLM use:
uv run generate.py --prompt "Tell me a story about a dog that loves ice"
The default model is defined in constants.py. To specify a model:
uv run generate.py --model mlx-community/Qwen2.5-0.5B-bf16 --prompt "Tell me a story about a dog that loves ice"
This project leverages the work of amazing open-source projects listed below. Special thanks to them for making such valuable resources available. The different sources explain the different coding styles in different files.
I removed a lot of features that I did not need for this project but I tried not to delete anything from the original files when I simplified functions or objects. I commented out unnecessary parts with '###' so my comments can be more easily distinguished from those in the original project.
License : it is my understanding that the Apple License (MIT) is the most restrictive and therefore applies to this project. See more details about the individual parts below.
- mlx-lm : mlx-example-repo
- root files
- models/*
see license
- transformers: HuggingFace transformers
- tokenization/* :substantially all files were copied from the HF transformers library. The objective was to use PretrrainedTokenizers from transformers but not the rest since it's covered by mlx-examples. Pytorch and Tensorflow utils were removed as well as all upload utils. I tried to consolidate files when it made sense to me (initial source in the transformers library is clearly identified in the files)
see license
- entropix : entropix
More specifically entropix-local mlx but also heavily inspired by SmolLM x Entropix
- entropix/* : Most of the code was lifted from the entropix-local repo (research branch) and adapted to this project. See notes for more information on what part of the project were implemented here.
see license
VRAM will be your constraint : as a rule of thumb, 2GB of VRAM for each 1B parameters in bf/fp16, so choose your model accordingly. You can use quantized models to reduce VRAM requirements.
Tokenizers and models architectures are available in this repo for the models or model families described below. Any model that matches a PreTrainedTokenizer and a model in this repo should work : for example, SmolLM2 works with a GTP2 tokenizer and Lllama model.
Note: for base model, you may want to ignore chat templates --ignore-chat-template
and increase repetition penalty, in particular for smaller models (unless you are prompting for a math question, in which case repeating tokens is necessary). If you use the entropix sampler, repetition penalty at 1 is better.
- mlx-community/Qwen2.5-0.5B-bf16
ignoring chat template, forcing eos_token and higher repetition penalty:
uv run generate.py --model mlx-community/Qwen2.5-0.5B-bf16 --ignore-chat-template --eos-token "<|endoftext|>" --repetition-penalty 1.5 --prompt "This is the story of a dog that loves ice. On a cold Winter day"
uv run generate.py --model mlx-community/Qwen2.5-0.5B-bf16 --ignore-chat-template --eos-token "<|endoftext|>" --repetition-penalty 1.5 --prompt "Which is the larger number between 9.11 and 9.8? Thinking step-by-step,"
- mlx-community/Qwen2.5-Coder-32B-Instruct-bf16
uv run generate.py --model mlx-community/Qwen2.5-Coder-32B-Instruct-bf16 --eos-token "<|endoftext|>" --prompt "Please a functional implementation of transformers, i.e not the usual object-oriented implementation"
- mlx-community/Llama-3.2-1B-Instruct-bf16 (currently set as default model in constants.py)
uv run generate.py --prompt "Tell me a story about a dog that loves ice"
uv run generate.py --prompt "what is larger between 9.11 and 9.8? think step by step"
- mlx-community/Meta-Llama-3.1-70B-bf16 70B in bf16 can run on a MacStudio with M2 Ultra and 192GB of RAM but will likely be very slow (see note on large models in the MLX section below)
uv run generate.py --model mlx-community/Meta-Llama-3.1-70B-bf16 --ignore-chat-template --prompt "To address the strengths and weaknesses of the business model of hairdressers, we must primarily focus on the route to market and cost base :"
- mlx-community/gemma-2-2b-it-fp16
uv run generate.py --model mlx-community/gemma-2-2b-it-fp16 --prompt "Please write a the story about a dog that loves ice"
- mlx-community/gemma-2-2b-fp16
uv run generate.py --model mlx-community/gemma-2-2b-fp16 --ignore-chat-template --prompt "This is the story of a dog that loves ice. On a cold Winter day"
- mlx-community/gemma-2-27b-fp16
uv run generate.py --model mlx-community/gemma-2-27b-fp16 --ignore-chat-template --prompt "To address the strengths and weaknesses of the business model of hairdressers, we must primarily focus on the route to market and cost base :"
- mlx-community/SmolLM2-360M-Instruct
suggesting to increase repetition penalty for smaller models
uv run generate.py --model mlx-community/SmolLM2-360M-Instruct --repetition-penalty 1.5 --prompt "Please write a the story about a dog that loves ice"
uv run generate.py --model mlx-community/SmolLM2-360M-Instruct --repetition-penalty 1.5 --prompt "what is larger between 9.11 and 9.8? think step by step" --sampler "entropix"
EXPERIMENTAL : A new argument was added to select the entropix sampler instead of the default sampler. use --sampler "entropix"
to use this sampler.
Note that this sampler is only the implementation of the original entropix project and does not reflect the current research branch. Adaptive sampling thresholds for Llama3.2-1B and SMOLLM2-360M are pulled from the the entropix-local and entropix-smollm respectively. See entropix/config.py for more details.
Further work on the entropix methods would require to add jax to the project, which contradicts the overarching philosophy of FAMLX. Let's see what the original entropix project yields and if it's so compelling that a change in philosophy is mandated.
from the mlx-example-repo
Use -h
to see a list of available options for a command, e.g.:
mlx_lm.generate -h
Chat and streaming still WIP
This requires macOS 15.0 or higher to work.
Models which are large relative to the total RAM available on the machine can be slow. mlx-lm will attempt to make them faster by wiring the memory occupied by the model and cache. This requires macOS 15 or higher to work.
If you see the following warning message:
[WARNING] Generating with a model that requires ...
then the model will likely be slow on the given machine. If the model fits in RAM then it can often be sped up by increasing the system wired memory limit. To increase the limit, set the following sysctl:
sudo sysctl iogpu.wired_limit_mb=N The value N should be larger than the size of the model in megabytes but smaller than the memory size of the machine.