[Question]: when parsing a PDF, is it possible to include page number, into individual node's meta data ? #17546

jzhao62 · 2025-01-18T07:06:59Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

as the question goes, it would be great if each node's meta data could contain page Number that points to its origianl occurance in the PDF

dosubot · 2025-01-18T07:07:58Z

Yes, when parsing a PDF using the PyMuPDFReader class, you can include the page number in each individual node's metadata. This is done by adding the page number to the extra_info dictionary with the key "source". Here's a code snippet that demonstrates how to do this:

return [
    Document(
        text=page.get_text().encode("utf-8"),
        extra_info=dict(
            extra_info,
            **{
                "source": f"{page.number+1}",
            },
        ),
    )
    for page in doc
]

In this snippet, page.number+1 is used to get the page number (starting from 1) and include it in the metadata for each document node ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

jzhao62 added the question Further information is requested label Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: when parsing a PDF, is it possible to include page number, into individual node's meta data ? #17546

[Question]: when parsing a PDF, is it possible to include page number, into individual node's meta data ? #17546

jzhao62 commented Jan 18, 2025

dosubot bot commented Jan 18, 2025

[Question]: when parsing a PDF, is it possible to include page number, into individual node's meta data ? #17546

[Question]: when parsing a PDF, is it possible to include page number, into individual node's meta data ? #17546

Comments

jzhao62 commented Jan 18, 2025

Question Validation

Question

dosubot bot commented Jan 18, 2025