Chunking documents for RAG using the document structure #1048

jwm4 · 2025-02-11T16:01:21Z

jwm4
Feb 11, 2025

I am looking at the code in vector_store.py, and it looks like it is using pypdf to convert documents to text and then splitting that text into overlapping chunks of fixed length. Splitting text into fixed length chunks tends to lead to chunks that are incoherent. For example you might get a chunk that starts in the middle of the last sentence of a section and then continues on to the start of the following section. The overlapping chunks partially offset this limitation: in the previous example the sentence that is split up at the start of the chunk is probably also available as a complete sentence in a different chunk so it is at least possible for RAG to get that entire sentence for generating an answer. However, overlapping chunks brings in its own problems too. For example, the redundancy of the overlapping chunks can result in a lot of duplicated content in the top search results which can crowd out lower ranked distinct results that could turn out to be the ones that have the answers to the questions.

One way to deal with that would be the following:

Replace the call to pypdf with a call to Docling.
Replace the simple fixed-length overlapping chunker with the Docling hybrid chunker which doesn't need to overlap chunks because it builds the chunks from the hierarchical structure of the document.

This would have a variety of other benefits too, e.g., access to Docling's relatively sophisticated handling of tables which can be very important for RAG applications where a lot of critical information is in tables.

Alternatively, rather than replacing the existing logic we could add Docling-based processing as an alternative option. If we make it an option, we could have the option be specified in the llama_stack_client.types.Document constructor or in the call to client.tool_runtime.rag_tool.insert. I would only want it to be an option if we can provide relatively clear guidance to users about the pros and cons of each option and how to choose one or the other.

I would note that DocQA in llama-stack-apps is already using Docling outside of Llama Stack. They take the following approach:

As a preprocessing step, they iterate through all the PDF, DOCX, and PPTX files and use Docling to convert them to Markdown.
Then at run-time they use the core Llama Stack processing on the Markdown files (which as I understand it is the fixed-size overlapping chunking I described above).

This seems like a decent way to get some of the benefits of Docling, e.g., the table processing. However, it misses out on the opportunity to use the structure to do the chunking. That seems particularly important with complex structures like tables, where it is particularly valuable to group them all together in one chunk whenever possible.

Another open question is the following: If we do add Docling support should it be inline or remote (so you can scale out Docling more) or configurable for either option. If it is configurable, it seems like it would make sense to have a provider type for this purpose (document processing) which would be a much bigger change than just an in-place rewrite of vector_store.py to use Docling instead of pypdf. However, it seems like it would also be more impactful and could open the door to a principled way to support a broader assortment of document processing technologies.

anastasds · 2025-02-11T16:07:00Z

anastasds
Feb 11, 2025

Docling is also available via LlamaIndex; another avenue is to adopt LlamaIndex.

What is not clear to me is whether doing so would require using it for vector store interactions as well, nor what the associated tradeoffs would be.

1 reply

jwm4 Feb 11, 2025
Author

I don't think there is much advantage to using LlamaIndex is we used it only as a wrapper for Docling -- in that case, it seems just as good and simpler to invoke Docling directly. However, in general I would say that LlamaIndex seems much more mature in the space of vector store interactions than Llama Stack is right now. For example, the number of supported vector indexes is much larger, but perhaps more importantly the range of functionalities supported for any given index seems broader and more powerful too (at least at first glance; hopefully, someone here can correct me if I am wrong). So we may want to seriously consider a much larger overhaul of Llama Stack's vector store operations and its built-in RAG tool to rebase on LlamaIndex and then we'd get Docling for "free" as part of that overhaul.

Of course there are two major challenges with this approach:

The Llama Stack community would need to relinquish a lot of control over its vector DB capabilities if it were to depend on a third-party library for those capabilities.
It is challenging to create an end user experience that feels seamless when your welding together multiple frameworks, each of which was designed to be relatively comprehensive on their own. If this is done carelessly, you can often wind up with something that feels incoherent and contradictory, e.g., with multiple different ways to select configuration options in different parts of the system depending on what you're doing and what layer you're interacting with.

Neither of these concerns seems completely disqualifying, but they also seem like reasons why we would want to proceed with caution at least.

ilya-kolchinsky · 2025-02-12T10:57:32Z

ilya-kolchinsky
Feb 12, 2025

I agree with @jwm4 that providing a controllable choice of a conversion tool / chunking method is a viable option. I have just created an issue on that (#1061) and am currently working on a PR. Unfortunately, I've only discovered this thread after opening the issue.

One important question here is whether we want the choice to be made at server side (via the .yaml configuration) or at the client side, i.e., allow the user to specify their priorities via the insert() method? Both have their pros and cons.

0 replies

franciscojavierarceo · 2025-02-12T11:37:25Z

franciscojavierarceo
Feb 12, 2025

I think it's worth considering generics in designing a mechanism for jus to be able to handle transforming arbitrary input data into context-ready documents. Docling is a great tool for that and we should probably propose a solution that empowers users to define their data and transform it according to their requirements.

How Feast (@feast-dev) does this is through an arbitrary UDF that can be executed at different times (e.g., during data ingestion).

It would probably be useful to make thresholding, chunking, tokenization, and transformation all configurable components of the RAG experience. We could probably keep the existing behavior as default.

1 reply

franciscojavierarceo Feb 12, 2025

CC @dmartinol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking documents for RAG using the document structure #1048

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Chunking documents for RAG using the document structure #1048

jwm4 Feb 11, 2025

Replies: 3 comments · 2 replies

anastasds Feb 11, 2025

jwm4 Feb 11, 2025 Author

ilya-kolchinsky Feb 12, 2025

franciscojavierarceo Feb 12, 2025

franciscojavierarceo Feb 12, 2025

jwm4
Feb 11, 2025

Replies: 3 comments 2 replies

anastasds
Feb 11, 2025

jwm4 Feb 11, 2025
Author

ilya-kolchinsky
Feb 12, 2025

franciscojavierarceo
Feb 12, 2025