You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of rag and async_rag don't return the chunk_id nor the document_id. This prevents creating proper citation sources in the response.
Solution
_contexts and retrieve_segments should return the (original) chunk_ids used for composing the segments and the document_ids instead of a list of strings.
A possible solution would be to return tuples(document_id, segment_str):
# Convert the segments into tuples of (document_id, segment_text)
segments_with_ids = [
(
segment[0].document_id, # Get document_id from first chunk in segment
segment[0].headings.strip() + "\n\n" + "".join(chunk.body for chunk in segment).strip()
)
for segment in segments
]
Maybe a better solution would be to create a proper type for Segment similar to Chunk.
Some considerations
We don't want to give as sources all the available segments, just the ones that the model decided to use. Also,
we can't just use the list of original chunk_ids and document_id and zip them with the segments because the retrieve_segments method merges continuous chunks, resulting in a many to one mapping between chunks and segments that we cannot reverse. In addition, providing the model with the document_id/chunk_id directly will potentially simplify the formatting of sources and allow other use cases (function calling using these ids).
The text was updated successfully, but these errors were encountered:
It's actually already possible to get access to the RAG sources as follows:
fromragliteimportRAGLiteConfig, hybrid_search, rerank_chunks, rag# Configure RAGLite (here, we use the default config):my_config=RAGLiteConfig()
# Search for chunks:prompt="What does it mean for two events to be simultaneous?"chunk_ids_hybrid, _=hybrid_search(prompt, num_results=20, config=my_config)
# Rerank chunks:chunks_reranked=rerank_chunks(prompt, chunks_hybrid, config=my_config)
# Pass the retrieved chunks as context for RAG:stream=rag(prompt, search=chunks_reranked, config=my_config)
The chunks_reranked list contains a lot of information on the sources, but if you need more information about the underlying document you could do this:
# Access the RAG sources:fromraglite._databaseimportcreate_database_enginefromsqlmodelimportSessionwithSession(create_database_engine()) assession:
reranked_chunks= [session.merge(chunk) forchunkinreranked_chunks] # Reattach the chunks to a Session.documents= [chunk.documentforchunkinreranked_chunks]
That said, this API certainly isn't perfect yet. What do you think about the following improvements?
We expose the _max_contexts method to compute the maximum number of Chunks that will fit in the LLM context, given the user prompt, system prompt, and message history.
The developer retrieves and reranks Chunks according to the example above.
The developer transforms the Chunks to segments with retrieve_segments (which expands Chunks with their neighbours and concatenates them into contiguous segments).
We modify rag and async_rag to accept segments.
We modify the rag and async_rag prompt to be able to reference segments by number (e.g., "According to [3], ...").
The problem
The current implementation of
rag
andasync_rag
don't return thechunk_id
nor thedocument_id
. This prevents creating proper citation sources in the response.Solution
_contexts
andretrieve_segments
should return the (original)chunk_id
s used for composing the segments and the document_ids instead of a list of strings.A possible solution would be to return tuples(document_id, segment_str):
Maybe a better solution would be to create a proper type for
Segment
similar toChunk
.Some considerations
We don't want to give as sources all the available segments, just the ones that the model decided to use. Also,
we can't just use the list of original
chunk_id
s anddocument_id
and zip them with the segments because theretrieve_segments
method merges continuous chunks, resulting in a many to one mapping between chunks and segments that we cannot reverse. In addition, providing the model with the document_id/chunk_id directly will potentially simplify the formatting of sources and allow other use cases (function calling using these ids).The text was updated successfully, but these errors were encountered: