🔽Examples below • 📙 Visit my other repo to learn more about Vision Language Models
- For Windows and Linux
cd custom_nodes
git clone https://github.com/gokayfem/ComfyUI_VLM_nodes.git
If you get errors related to llama-cpp-python or if it is not using GPU.
I recommend installing it with the right arguments provided in this link llama-cpp-python
Utilizes llama-cpp-python
for integration of LLaVa models. You can load and use any VLM with LLaVa models in GGUF format with this nodes.
You need to download the model similar to ggml-model-q4_k.gguf
and it's clip projector similar to mmproj-model-f16.gguf
from this repositories (in the files and versions).
python=>3.9
is necessary.
Put all of the files inside models/LLavacheckpoints
Note that every model's clip projector is different!
Getting structured outputs can be quite challenging through prompt engineering alone.
I've added the Structured Output node to VLM Nodes.
Now, you can obtain your answers reliably.
You can extract entities, numbers, classify prompts with given classes, and generate one specific prompt. These are just a few examples.
You can add additional descriptions to fields and choose the attributes you want it to return.
Utilizes VLMs, LLMs and AudioLDM-2 to make music from images.
Use SaveAudioNode to save the music inside output
folder.
It will automatically download the necessary files into models/LLavacheckpoints/files_for_audioldm2
output.mp4
Utilizes Chat Musician, an open-source LLM that integrates intrinsic musical abilities.
ChatMusician Demo Page
You can try prompts from this demo page.
Download the GGUF file
ChatMusician GGUF Files
ChatMusician.Q5_K_M.gguf or ChatMusician.Q5_K_S.gguf recommended
BIG BIG BIG Warning: It does NOT work perfectly, if you got errors accept the error queue prompt again with the same settings!!
chatmusician.mp4
Utilizes AutoGPTQ
for integration of InternLM-XComposer2-VL Model. It will automatically download the necessary files into models/LLavacheckpoints/files_for_internlm
.
This is one of the best models for visual perception.
Important Note : This model is heavy.
Get Keyword node: It can take LLava outputs and extract keywords from them.
LLava PromptGenerator node: It can create prompts given descriptions or keywords using (input prompt could be Get Keyword or LLava output directly).
Suggester node: It can generate 5 different prompts based on the original prompt using consistent in the options or random prompts using random in the options.
- Works best with LLava 1.5 and 1.6.
Play with the temperature
for creative or consistent results. Higher the temperature more creative are the results.
If you want to dive deep into LLM Settings
Outputs are JSON looking texts, you can see them as a text using JsonToText Node.
You can see any string output with ViewText Node
You can set any string input using SimpleText Node
Utilizes llama-cpp-agents
for getting structured outputs.
LLM PromptGenerator node:
Qwen 1.8B Stable Diffusion Prompt
IF prompt MKR
This LLM's works best for now for prompt generation.
LLMSampler node: You can chat with any LLM in gguf format, you can use LLava models as an LLM also.
API PromptGenerator node: You can use ChatGPT and DeepSeek API's to create prompts. https://platform.deepseek.com/ gives 10m free tokens.
- ChatGPT-4
- ChatGPT-3.5
- DeepSeek You can use them for simple chat also there is an option in the node.
UForm-Gen2 is an extremely fast small generative vision-language model primarily designed for Image Captioning and Visual Question Answering.
UForm-Gen2 Qwen
It will automatically download the necessary files into models/LLavacheckpoints/files_for_uform_gen2_qwen
Kosmos-2: Grounding Multimodal Large Language Models to the World.
Kosmos-2
It will automatically download the necessary files into models/LLavacheckpoints/files_for_kosmos2
This node is designed to work with the Moondream model, a powerful small vision language model built by @vikhyatk using SigLIP, Phi-1.5, and the LLaVa training dataset. The model boasts 1.6 billion parameters and is made available for research purposes only; commercial use is not allowed.
moondream2 is a small vision language model designed to run efficiently on edge devices.
It will automatically download the necessary files into models/LLavacheckpoints/files_for__moondream
and models/LLavacheckpoints/files_for_moondream2
@fpgamine's JoyTag is a state of the art AI vision model for tagging images, with a focus on sex positivity and inclusivity.
It uses the Danbooru tagging schema, but works across a wide range of images, from hand drawn to photographic.
It will automatically download the necessary files into models/LLavacheckpoints/files_for_joytagger
Utilizes the latest Qwen2-VL series of models, which are state-of-the-art vision language models supporting various resolutions, ratios, and languages. The models excel at:
- Understanding images of various resolutions & ratios
- Complex visual reasoning and decision making
- Multilingual support (English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, etc.)
Available models include 2B, 7B, and 72B parameter versions, with standard, AWQ, and GPTQ quantized variants. It will automatically download the necessary files into models/LLavacheckpoints/files_for_qwen2vl
.
Important Note: Larger models (7B, 72B) require significant VRAM. Choose quantized versions (AWQ, GPTQ) for reduced memory usage.