Why does RAG agent example from Getting Started seem to stop retrieving any context from documents? #995

booxter · 2025-02-06T17:41:34Z

booxter
Feb 6, 2025

I followed https://llama-stack.readthedocs.io/en/latest/getting_started/index.html

When I got to execute RAG agent example: https://llama-stack.readthedocs.io/en/latest/getting_started/index.html#your-first-rag-agent I see the following behavior.

Initially, when I execute the script, it seems to work (the inferred answer includes relevant information from the ingested documents).
Eventually, after a number of repeated executions of the script, the answers I get from the model seem to suggest that no context is being provided at all.

See:

Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.65it/s]
User> What are the top 5 topics that were explained? Only list succinct bullet points.
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.54it/s]
tool_execution> Tool:query_from_memory Args:{}
tool_execution> fetched 10709 bytes from memory
inference>     Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant

I found out a workaround for this is removing the faiss_store.db

mv ~/.llama/distributions/ollama/faiss_store.db ~/.llama/distributions/ollama/faiss_store.db.bak

After that, the script seems to work again, for a number of executions.

Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.67it/s]
User> What are the top 5 topics that were explained? Only list succinct bullet points.
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.25it/s]
tool_execution> Tool:query_from_memory Args:{}
tool_execution> fetched 10910 bytes from memory
inference> This is not a problem to be solved, but rather a tutorial on how to fine-tune Llama3 Instruct with chat data. However, I can provide a summary of the main points and key takeaways.

**Key Takeaways:**

[...]

Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.11it/s]
User> What are the top 5 topics that were explained? Only list succinct bullet points.
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.91it/s]
tool_execution> Tool:query_from_memory Args:{}
tool_execution> fetched 10913 bytes from memory
inference> Hello! It seems you've provided a large amount of text from the TorchTune documentation. I'll do my best to summarize the main points and provide a clear answer.

The main topic is about fine-tuning the Llama3 model with a custom chat dataset using TorchTune. Here's a concise summary:

**Tokenization**

[...]

Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.58it/s]
User> What are the top 5 topics that were explained? Only list succinct bullet points.
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.84it/s]
tool_execution> Tool:query_from_memory Args:{}
tool_execution> fetched 10705 bytes from memory
inference> You're looking at a sample of a single user-assistant turn accompanied with a system prompt:

.. code-block:: python

[...]

Then eventually, after 3 reasonable answers, it's back to the context-less greetings.

Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.00it/s]
User> What are the top 5 topics that were explained? Only list succinct bullet points.
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.10it/s]
tool_execution> Tool:query_from_memory Args:{}
tool_execution> fetched 10707 bytes from memory
inference>     Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant.

It feels like maybe inference touches the kvstore somehow? (I see the db file changes its checksum but not sure if it's not some runtime metadata for sqlite that is of no importance to embeddings.)

Does the output make sense? Shouldn't answers be independent of the number of times the agent code is executed?

Answered by hardikjshah

Feb 7, 2025

Thanks for raising this, its because we update the same faiss index ( vector db ) with the same document chunks. Which means eventually, when we ask for context -- we get the same chunk again and again and that might be completely messing up the model response.

#998 fixes this.

View full answer

hardikjshah · 2025-02-07T00:15:23Z

hardikjshah
Feb 7, 2025
Collaborator

Thanks for raising this, its because we update the same faiss index ( vector db ) with the same document chunks. Which means eventually, when we ask for context -- we get the same chunk again and again and that might be completely messing up the model response.

#998 fixes this.

0 replies

booxter · 2025-02-07T00:28:45Z

booxter
Feb 7, 2025
Author

Thanks for the fix! Confirmed it resolves the issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does RAG agent example from Getting Started seem to stop retrieving any context from documents? #995

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Why does RAG agent example from Getting Started seem to stop retrieving any context from documents? #995

booxter Feb 6, 2025

Replies: 2 comments

hardikjshah Feb 7, 2025 Collaborator

booxter Feb 7, 2025 Author

booxter
Feb 6, 2025

hardikjshah
Feb 7, 2025
Collaborator

booxter
Feb 7, 2025
Author