Skip to content

feat: persist PDF bookmark outline as document metadata#13287

Open
yuch85 wants to merge 1 commit intoinfiniflow:mainfrom
yuch85:feat/pdf-outline-persistence
Open

feat: persist PDF bookmark outline as document metadata#13287
yuch85 wants to merge 1 commit intoinfiniflow:mainfrom
yuch85:feat/pdf-outline-persistence

Conversation

@yuch85
Copy link

@yuch85 yuch85 commented Mar 1, 2026

Summary

PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's pdf_parser.outlines already extracts these (title, depth) tuples via pypdf, but they are used ephemerally during chunking (manual parser uses them for hierarchy detection) and then discarded.

This PR persists the outline as doc.meta_fields["outline"] — a JSON array of {"title": str, "depth": int} objects — so downstream features can use the structural information.

Why this matters

  • Complementary to toc_extraction — the existing toc_extraction feature uses LLM calls to generate a TOC and only works for the naive parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure.
  • Document navigation — frontends can render a clickable TOC from the outline
  • Entity extraction — the outline provides a structural map for identifying document sections and key topics
  • Search result context — knowing which section a chunk belongs to helps users evaluate relevance

Changes

File Change LOC
rag/app/naive.py Attach pdf_parser.outlines as __outline__ on first chunk dict ~7
rag/app/manual.py Same for the manual parser ~5
rag/svr/task_executor.py Extract __outline__, persist via DocMetadataService.update_document_metadata() ~12

Design decisions

  • Transient key pattern: The outline is passed from parser → task_executor via __outline__ on the first chunk dict, then removed before indexing. This follows the same pattern as metadata_obj for LLM-generated metadata.
  • No schema changes: Uses the existing meta_fields JSON column on the document table.
  • Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted.

Backward compatibility

  • Fully backward compatible — no existing fields, behavior, or schemas changed
  • PDFs without outlines are unaffected
  • Existing meta_fields data is preserved (merged, not overwritten)

Test plan

  • Parse a PDF with bookmarks (e.g. any multi-chapter document), verify meta_fields["outline"] is populated
  • Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields
  • Verify existing meta_fields data is preserved (not overwritten) when outline is added
  • Verify manual parser also persists outlines
  • Verify outline JSON structure: [{"title": "Chapter 1", "depth": 0}, ...]

Related: #9921 (Deterministic Document Access Layer)

🤖 Generated with Claude Code

PDF files often contain a bookmark/outline tree (table of contents
built into the file structure by the authoring tool). RAGFlow's
`pdf_parser.outlines` already extracts these `(title, depth)` tuples,
but they are used ephemerally during chunking and then discarded.

This commit persists the outline as `doc.meta_fields["outline"]` so
downstream features can use it:
- TOC-enhanced retrieval (complementary to the LLM-based
  `toc_extraction` — this is free and works for all parser types)
- Document navigation in the frontend
- Entity extraction (structural map of the document)

Changes:
- `rag/app/naive.py`: Attach `pdf_parser.outlines` as `__outline__`
  on the first chunk dict (transient key, same pattern as
  `metadata_obj`)
- `rag/app/manual.py`: Same attachment for the manual parser
- `rag/svr/task_executor.py`: Extract `__outline__` from the first
  chunk after `build_chunks()`, persist via
  `DocMetadataService.update_document_metadata()`

No schema changes — uses existing `meta_fields` JSON column.
Zero cost — outlines are already extracted by pypdf, just not saved.
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 💞 feature Feature request, pull request that fullfill a new feature. labels Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

💞 feature Feature request, pull request that fullfill a new feature. size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant