feat: persist PDF bookmark outline as document metadata by yuch85 · Pull Request #13287 · infiniflow/ragflow

yuch85 · 2026-03-01T06:54:00Z

Summary

PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's pdf_parser.outlines already extracts these (title, depth) tuples via pypdf, but they are used ephemerally during chunking (manual parser uses them for hierarchy detection) and then discarded.

This PR persists the outline as doc.meta_fields["outline"] — a JSON array of {"title": str, "depth": int} objects — so downstream features can use the structural information.

Why this matters

Complementary to toc_extraction — the existing toc_extraction feature uses LLM calls to generate a TOC and only works for the naive parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure.
Document navigation — frontends can render a clickable TOC from the outline
Entity extraction — the outline provides a structural map for identifying document sections and key topics
Search result context — knowing which section a chunk belongs to helps users evaluate relevance

Changes

File	Change	LOC
`rag/app/naive.py`	Attach `pdf_parser.outlines` as `__outline__` on first chunk dict	~7
`rag/app/manual.py`	Same for the manual parser	~5
`rag/svr/task_executor.py`	Extract `__outline__`, persist via `DocMetadataService.update_document_metadata()`	~12

Design decisions

Transient key pattern: The outline is passed from parser → task_executor via __outline__ on the first chunk dict, then removed before indexing. This follows the same pattern as metadata_obj for LLM-generated metadata.
No schema changes: Uses the existing meta_fields JSON column on the document table.
Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted.

Backward compatibility

Fully backward compatible — no existing fields, behavior, or schemas changed
PDFs without outlines are unaffected
Existing meta_fields data is preserved (merged, not overwritten)

Test plan

Parse a PDF with bookmarks (e.g. any multi-chapter document), verify meta_fields["outline"] is populated
Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields
Verify existing meta_fields data is preserved (not overwritten) when outline is added
Verify manual parser also persists outlines
Verify outline JSON structure: [{"title": "Chapter 1", "depth": 0}, ...]

Related: #9921 (Deterministic Document Access Layer)

🤖 Generated with Claude Code

PDF files often contain a bookmark/outline tree (table of contents built into the file structure by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples, but they are used ephemerally during chunking and then discarded. This commit persists the outline as `doc.meta_fields["outline"]` so downstream features can use it: - TOC-enhanced retrieval (complementary to the LLM-based `toc_extraction` — this is free and works for all parser types) - Document navigation in the frontend - Entity extraction (structural map of the document) Changes: - `rag/app/naive.py`: Attach `pdf_parser.outlines` as `__outline__` on the first chunk dict (transient key, same pattern as `metadata_obj`) - `rag/app/manual.py`: Same attachment for the manual parser - `rag/svr/task_executor.py`: Extract `__outline__` from the first chunk after `build_chunks()`, persist via `DocMetadataService.update_document_metadata()` No schema changes — uses existing `meta_fields` JSON column. Zero cost — outlines are already extracted by pypdf, just not saved.

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 💞 feature Feature request, pull request that fullfill a new feature. labels Mar 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: persist PDF bookmark outline as document metadata#13287

feat: persist PDF bookmark outline as document metadata#13287
yuch85 wants to merge 1 commit intoinfiniflow:mainfrom
yuch85:feat/pdf-outline-persistence

yuch85 commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuch85 commented Mar 1, 2026

Summary

Why this matters

Changes

Design decisions

Backward compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant