Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for LMStudio + MLX for maximal speed and efficiency on Apple Silicon #1495

Open
AriaShishegaran opened this issue Oct 25, 2024 · 3 comments

Comments

@AriaShishegaran
Copy link

Is your feature request related to a problem? Please describe.
With the recent major improvements over LMStudio, including the headless mode, it is now a a powerful alternative to Ollama with a lot of appealing features, including native support for MLX for Apple Silicon devices offering huge inference time improvements for local models.

Based on my tests i achieved 300-500% speed improvement compared to using Ollama over the same model (Llama3.2-3B).

@NolanTrem
Copy link
Collaborator

I've been thinking about this as well… It seems that this is possible with LiteLLM as is, but it would be great if they had a full integration.

BerriAI/litellm#3755

Part of the reason that we wouldn't look to support this natively in R2R (similarly, how we don't directly support Ollama any longer, and instead route through LiteLLM) is that we believe these integrations to not be a core part of our infrastructure. Rather than focusing on maintaining integrations, we would look to contribute to LiteLLM and other dependencies of ours.

I'll play around with this over the weekend and will add some information into the docs. Let us know if you're able to get it working!

@AriaShishegaran
Copy link
Author

@NolanTrem The approach makes total sense what if there are 10 other providers, it makes sense to have a higher level router handle them and you just using them.
As you said, at least having a robust doc for this could also be very helpful.
I'll try to make it work and if it was successful I'll share my insights here.

@NolanTrem
Copy link
Collaborator

Had a chance to play around with LMStudio and was extremely impressed by its performance over Ollama. There were a few changes that I had to make to our LiteLLM provider file in order to get embeddings to work (which just included dropping unsupported parameters.) I'll look to make this a permanent change, as I would be inclined to switch over to LMStudio for most of my testing going forward.

Here's the config that I ended up running with:

[agent]
system_instruction_name = "rag_agent"
tool_names = ["search"]

  [agent.generation_config]
  model = "openai/llama-3.2-3b-instruct"

[completion]
provider = "litellm"
concurrent_request_limit = 1

  [completion.generation_config]
  model = "openai/llama-3.2-3b-instruct"
  base_dimension = 768
  temperature = 0.1
  top_p = 1
  max_tokens_to_sample = 1_024
  stream = false
  add_generation_kwargs = { }

[embedding]
provider = "litellm"
base_model = "openai/text-embedding-nomic-embed-text-v1.5"
base_dimension = 768
batch_size = 128
add_title_as_prefix = true
concurrent_request_limit = 2

[orchestration]
provider = "simple"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants