feat: persist PDF bookmark outline as document metadata#13287
Open
yuch85 wants to merge 1 commit intoinfiniflow:mainfrom
Open
feat: persist PDF bookmark outline as document metadata#13287yuch85 wants to merge 1 commit intoinfiniflow:mainfrom
yuch85 wants to merge 1 commit intoinfiniflow:mainfrom
Conversation
PDF files often contain a bookmark/outline tree (table of contents built into the file structure by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples, but they are used ephemerally during chunking and then discarded. This commit persists the outline as `doc.meta_fields["outline"]` so downstream features can use it: - TOC-enhanced retrieval (complementary to the LLM-based `toc_extraction` — this is free and works for all parser types) - Document navigation in the frontend - Entity extraction (structural map of the document) Changes: - `rag/app/naive.py`: Attach `pdf_parser.outlines` as `__outline__` on the first chunk dict (transient key, same pattern as `metadata_obj`) - `rag/app/manual.py`: Same attachment for the manual parser - `rag/svr/task_executor.py`: Extract `__outline__` from the first chunk after `build_chunks()`, persist via `DocMetadataService.update_document_metadata()` No schema changes — uses existing `meta_fields` JSON column. Zero cost — outlines are already extracted by pypdf, just not saved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's
pdf_parser.outlinesalready extracts these(title, depth)tuples via pypdf, but they are used ephemerally during chunking (manualparser uses them for hierarchy detection) and then discarded.This PR persists the outline as
doc.meta_fields["outline"]— a JSON array of{"title": str, "depth": int}objects — so downstream features can use the structural information.Why this matters
toc_extraction— the existingtoc_extractionfeature uses LLM calls to generate a TOC and only works for thenaiveparser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure.Changes
rag/app/naive.pypdf_parser.outlinesas__outline__on first chunk dictrag/app/manual.pyrag/svr/task_executor.py__outline__, persist viaDocMetadataService.update_document_metadata()Design decisions
__outline__on the first chunk dict, then removed before indexing. This follows the same pattern asmetadata_objfor LLM-generated metadata.meta_fieldsJSON column on the document table.Backward compatibility
meta_fieldsdata is preserved (merged, not overwritten)Test plan
meta_fields["outline"]is populatedmeta_fieldsdata is preserved (not overwritten) when outline is addedmanualparser also persists outlines[{"title": "Chapter 1", "depth": 0}, ...]Related: #9921 (Deterministic Document Access Layer)
🤖 Generated with Claude Code