Replies: 2 comments 3 replies
-
|
I'm concerned about the quality implications of a "two step" approach. Vision-language models like Jina CLIP are specifically trained to create embeddings in a shared semantic space where images and text align directly. The two-step process (image --> text description --> text embedding) introduces a significant information bottleneck - the text description loses visual details that CLIP would capture directly from pixels, and a text-only embedding model was never trained on vision-text alignment. This will result in notably lower quality embeddings compared to local CLIP. A multimodal embedding model is really what's necessary here for this to work well. Ollama doesn't currently support it (see the open feature request here), so it's unlikely that we'd want to implement anything until multimodal embedding models are at least supported there. Additionally, we just opened an official feature request in the queue here: #21228 |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for creating the discussion. It's worth pointing out in general that 0.17 is already frozen for new features, so this would be going into 0.18 which will have many months of work, so we will most likely approach this slowly as we consider the best approach. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to add support for remote embedding providers. One option that seems relatively simple to add is utilizing the genai providers to do embeddings. The main issue with this approach is that none of the existing providers support multi-modal embedding apis/models. Another mechanism is to support clip-as-service which utilizes the api maintained by jina and seems to support the same models already used by frigate.
One way to support embeddings with genai (with currently supported api schema/models) is to use the image models to provide a text based description of images, and then embed that text instead of the image. It seems like this will likely work with some tweaking of the prompts to get the description to be useful for this purpose.
Another thing to consider is if we need support for multiple genai providers, especially as we start leveraging more features that could be present with some providers and not with others.
As the feature-set evolves, we may also need to support detection of prompt/model changes that indicate a reindex is required. For example, if ollama adds support for direct multi-modal embeddings, frigate should add support in a way that allows users to opt-in to the new feature or maintain the old mechanism to avoid a necessary reindex.
I guess another point to discuss is if this feature is necessary at all. My opinion is that as models grow larger, and require more specialized hardware, it's very useful to be able to have a dedicated system for LLM's and other large AI models with the necessary GPUs and other hardware to support it. It's very useful to set this up once and have other systems leverage it via an API. Additionally, since most use cases for embeddings are a non-critical function to a security system, it also makes sense to segregate the logic doing this processing to maintain stability of the core system as load goes up.
Beta Was this translation helpful? Give feedback.
All reactions