-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
When a PDF has a stream with thousands of elements in it, reader._get_object_from_stream has poor performance because that function and the for loop inside of it can be invoked many thousands of times. For one particular PDF I have this results in 99% of program time just spent in that one function, which can add up to dozens of seconds over the course of processing the pdf.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.8.0-87-generic-x86_64-with-glibc2.39
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.3.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=12.0.0Code + PDF
Unfortunately the pdf is proprietary so I cannot share it but the part that I think matters is
1 0 obj
<</Length 323433 /Filter/FlateDecode/Type /ObjStm /N 16718 /First 247616 >>
stream
I have been unsuccessful thus far in creating an artificial pdf with the same issue just using code.
A simple loop that copies pages from a reader to a writer can take many seconds to complete:
reader = pypdf.PdfReader("test.pdf")
writer = pypdf.PdfWriter()
total_start = time.time()
for page in range(0, 1):
print("Adding page:", page)
start = time.time()
writer.add_page(reader.pages[page])
end = time.time()
print(" took", end - start, "seconds")
I put in a hack to fix this by caching the offsets within the stream so that they can be computed a single time and re-used later:
def _get_object_from_stream(
self, indirect_reference: IndirectObject
) -> Union[int, PdfObject, str]:
# indirect reference to object in object stream
# read the entire object stream into memory
stmnum, idx = self.xref_objStm[indirect_reference.idnum]
obj_stm: EncodedStreamObject = IndirectObject(stmnum, 0, self).get_object() # type: ignore
print("Get object from", id(obj_stm))
# This is an xref to a stream, so its type better be a stream
assert cast(str, obj_stm["/Type"]) == "/ObjStm"
stream_data = BytesIO(obj_stm.get_data())
# print("Get object from stream at", stream_data.tell())
if id(obj_stm) not in self.cache:
self.cache[id(obj_stm)] = {}
cache1 = self.cache[id(obj_stm)]
if indirect_reference.idnum in cache1:
# seek directly to the position where the object lives
value = cache1[indirect_reference.idnum]
stream_data.seek(value["pos"], 0)
# print("Seek to cached pos", value["pos"], "for", indirect_reference.idnum)
# raise Exception("xyz")
count = 0
for i in range(obj_stm["/N"]): # type: ignore
count += 1
start_pos = stream_data.tell()
read_non_whitespace(stream_data)
stream_data.seek(-1, 1)
objnum = NumberObject.read_from_stream(stream_data)
read_non_whitespace(stream_data)
stream_data.seek(-1, 1)
offset = NumberObject.read_from_stream(stream_data)
read_non_whitespace(stream_data)
stream_data.seek(-1, 1)
# cache the position of the object
cache1[objnum] = {
"offset": offset,
"pos": start_pos,
}
Where the cache is created in the constructor of PdfReader:
def __init__(self, ...):
self.cache = {}
Using this hacky cache reduces the time to parse the pdf from multiple seconds to under one second. If this solution (using a cache) seems reasonable I can clean up the code and submit a PR. It would also probably make sense to have the cache be a weakref dictionary so that memory does not balloon up unnecessarily.