Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't find any configuration for OpenAI Embeddings #875

Open
HasnainKhanNiazi opened this issue Jun 19, 2024 · 4 comments
Open

Can't find any configuration for OpenAI Embeddings #875

HasnainKhanNiazi opened this issue Jun 19, 2024 · 4 comments

Comments

@HasnainKhanNiazi
Copy link

Hey, I am playing around with marqo, did multiple experiments and I am having a few questions.

  1. I can use any model given here to generate embeddings and create an index: https://docs.marqo.ai/2.8/Guides/Models-Reference/list_of_models/ ; But how can I use OpenAI text-embedding-03-large or any other model which is not available on huggingface.
  2. I used hf/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large to generate embeddings and created an index and now search isn't working as expected. For example, if I type in a search query "bike" then the first 3-4 retrieved documents are not even related to bikes.
  3. I have also tried using filter_string but in the case of filter_string, the results are empty list.

This is how I am using marqo index;

docs = []

for index, row in data.iterrows():
    local_dict = {}
    local_dict["title"] = row["title"]
    local_dict["description"] = row["markdown"]
    local_dict["attributes"] = row["attributes"]
    docs.append(local_dict)



mq = marqo.Client(url='http://localhost:8882')
results = mq.index("my-first-index").delete()
mq.create_index("my-first-index", model='hf/multilingual-e5-large')

mq.index("my-first-index").add_documents(docs,
    tensor_fields=["title", "description", "attributes"], client_batch_size=64
)

results_with_filters = mq.index("my-first-index").search(
    q="Bike", filter_string="price:[0 To 1000]"
)

results_without_filters = mq.index("my-first-index").search(
    q="Bike"
)

And above both queries are not working as expected. Any help or guidance will be appreciated. Thanks

@tomhamer
Copy link
Contributor

Hey Hasnain, thanks for reaching out! Have you tried the regular e5/large embeddings? These are significantly more performant in english. If you need multi-lingual embeddings, openai doesnt support those at the moment. In any case, we don't currently support openai embeddings in Marqo.

Another option to get better performance would be to sign up for Marqtune so you can finetune your embeddings to improve them for your usecase.

@wanliAlex
Copy link
Collaborator

Another option is to generate your embeddings outside Marqo and use the custom embeddings feature when indexing documents and searching. Check here on how to index documents with custom vectors and here on how to search with your custom vectors.

@HasnainKhanNiazi
Copy link
Author

Thanks @tomhamer @wanliAlex for the suggestions. I am having multi-lingual document (German, Italian, English). I will checkout the custom embeddings section as well.

One follow-up question related to filter_string, to the best of my understanding for filter string it is required to add values separately for example;

If I add price like this then the query filtering is working fine

mq.index("my-first-index").add_documents([
    {
        "Title": "The Travels of Marco Polo",
        "Description": "A 13th-century travelogue describing Polo's travels"
    }, 
    {
        "Title": "Extravehicular Mobility Unit (EMU)",
        "Description": "The EMU is a spacesuit that provides environmental protection, "
                       "mobility, life support, and communications for astronauts;  'price': '100'",
        "_id": "article_591",
        'price': '100',
    }],
    tensor_fields=["Description"]
)

But lets say price is added or written somewhere in the description then filter_string won't be working.

The main problem in my case is that if I keep adding new fields for each different attribute then I will end up having around 2000 fields which is way too much and that's why I am looking for a solution to do the matching/fuzzy matching in the description.

@HasnainKhanNiazi
Copy link
Author

@tomhamer

If you need multi-lingual embeddings, openai doesnt support those at the moment.

What do you mean by this line? OpenAI text-embedding-03-large is multi-lingual and for simple vector search, it is giving me better results if I compare with any other Open source model but I wanna do some more keyword search like filter_string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants