Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunk_id and document_id not accessible #50

Open
undo76 opened this issue Nov 23, 2024 · 1 comment
Open

chunk_id and document_id not accessible #50

undo76 opened this issue Nov 23, 2024 · 1 comment

Comments

@undo76
Copy link
Contributor

undo76 commented Nov 23, 2024

The problem

The current implementation of rag and async_rag don't return the chunk_id nor the document_id. This prevents creating proper citation sources in the response.

Solution

_contexts and retrieve_segments should return the (original) chunk_ids used for composing the segments and the document_ids instead of a list of strings.

A possible solution would be to return tuples(document_id, segment_str):

    # Convert the segments into tuples of (document_id, segment_text)
    segments_with_ids = [
        (
            segment[0].document_id,  # Get document_id from first chunk in segment
            segment[0].headings.strip() + "\n\n" + "".join(chunk.body for chunk in segment).strip()
        )
        for segment in segments
    ]

Maybe a better solution would be to create a proper type for Segment similar to Chunk.

Some considerations

We don't want to give as sources all the available segments, just the ones that the model decided to use. Also,
we can't just use the list of original chunk_ids and document_id and zip them with the segments because the retrieve_segments method merges continuous chunks, resulting in a many to one mapping between chunks and segments that we cannot reverse. In addition, providing the model with the document_id/chunk_id directly will potentially simplify the formatting of sources and allow other use cases (function calling using these ids).

@lsorber
Copy link
Member

lsorber commented Nov 25, 2024

Hi @undo76, thanks for submitting this issue!

It's actually already possible to get access to the RAG sources as follows:

from raglite import RAGLiteConfig, hybrid_search, rerank_chunks, rag

# Configure RAGLite (here, we use the default config):
my_config = RAGLiteConfig()

# Search for chunks:
prompt = "What does it mean for two events to be simultaneous?"
chunk_ids_hybrid, _ = hybrid_search(prompt, num_results=20, config=my_config)

# Rerank chunks:
chunks_reranked = rerank_chunks(prompt, chunks_hybrid, config=my_config)

# Pass the retrieved chunks as context for RAG:
stream = rag(prompt, search=chunks_reranked, config=my_config)

The chunks_reranked list contains a lot of information on the sources, but if you need more information about the underlying document you could do this:

# Access the RAG sources:
from raglite._database import create_database_engine
from sqlmodel import Session

with Session(create_database_engine()) as session:
    reranked_chunks = [session.merge(chunk) for chunk in reranked_chunks]  # Reattach the chunks to a Session.
    documents = [chunk.document for chunk in reranked_chunks]

That said, this API certainly isn't perfect yet. What do you think about the following improvements?

  1. We expose the _max_contexts method to compute the maximum number of Chunks that will fit in the LLM context, given the user prompt, system prompt, and message history.
  2. The developer retrieves and reranks Chunks according to the example above.
  3. The developer transforms the Chunks to segments with retrieve_segments (which expands Chunks with their neighbours and concatenates them into contiguous segments).
  4. We modify rag and async_rag to accept segments.
  5. We modify the rag and async_rag prompt to be able to reference segments by number (e.g., "According to [3], ...").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants