reader._get_object_from_stream inefficient

When a PDF has a stream with thousands of elements in it, `reader._get_object_from_stream` has poor performance because that function and the for loop inside of it can be invoked many thousands of times. For one particular PDF I have this results in 99% of program time just spent in that one function, which can add up to dozens of seconds over the course of processing the pdf.

## Environment

Which environment were you using when you encountered the problem?

```bash
$ python -m platform
Linux-6.8.0-87-generic-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.3.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=12.0.0
```

## Code + PDF

Unfortunately the pdf is proprietary so I cannot share it but the part that I think matters is
```
1 0 obj
<</Length 323433 /Filter/FlateDecode/Type /ObjStm /N 16718 /First 247616 >>
stream
```

I have been unsuccessful thus far in creating an artificial pdf with the same issue just using code.

A simple loop that copies pages from a reader to a writer can take many seconds to complete:
```
reader = pypdf.PdfReader("test.pdf")
writer = pypdf.PdfWriter()

total_start = time.time()

for page in range(0, 1):
    print("Adding page:", page)
    start = time.time()
    writer.add_page(reader.pages[page])
    end = time.time()
    print("  took", end - start, "seconds")
```

I put in a hack to fix this by caching the offsets within the stream so that they can be computed a single time and re-used later:
```
def _get_object_from_stream(
        self, indirect_reference: IndirectObject
    ) -> Union[int, PdfObject, str]:
        # indirect reference to object in object stream
        # read the entire object stream into memory
        stmnum, idx = self.xref_objStm[indirect_reference.idnum]
        obj_stm: EncodedStreamObject = IndirectObject(stmnum, 0, self).get_object()  # type: ignore

        print("Get object from", id(obj_stm))

        # This is an xref to a stream, so its type better be a stream
        assert cast(str, obj_stm["/Type"]) == "/ObjStm"
        stream_data = BytesIO(obj_stm.get_data())
        # print("Get object from stream at", stream_data.tell())

        if id(obj_stm) not in self.cache:
            self.cache[id(obj_stm)] = {}
        cache1 = self.cache[id(obj_stm)]

        if indirect_reference.idnum in cache1:
            # seek directly to the position where the object lives
            value = cache1[indirect_reference.idnum]
            stream_data.seek(value["pos"], 0)
            # print("Seek to cached pos", value["pos"], "for", indirect_reference.idnum)
            # raise Exception("xyz")

        count = 0
        for i in range(obj_stm["/N"]):  # type: ignore
            count += 1
            start_pos = stream_data.tell()

            read_non_whitespace(stream_data)
            stream_data.seek(-1, 1)
            objnum = NumberObject.read_from_stream(stream_data)
            read_non_whitespace(stream_data)
            stream_data.seek(-1, 1)
            offset = NumberObject.read_from_stream(stream_data)
            read_non_whitespace(stream_data)
            stream_data.seek(-1, 1)

            # cache the position of the object
            cache1[objnum] = {
                "offset": offset,
                "pos": start_pos,
            }
```
Where the cache is created in the constructor of PdfReader:
```
def __init__(self, ...):
  self.cache = {}
```
Using this hacky cache reduces the time to parse the pdf from multiple seconds to under one second. If this solution (using a cache) seems reasonable I can clean up the code and submit a PR. It would also probably make sense to have the cache be a weakref dictionary so that memory does not balloon up unnecessarily.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reader._get_object_from_stream inefficient #3527

Environment

Code + PDF

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

reader._get_object_from_stream inefficient #3527

Description

Environment

Code + PDF

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions