Intermittent `IndexError` when accessing `PdfReader.pages` with `ThreadPoolExecutor` #3024

blairfrandeen · 2024-12-28T21:17:00Z

I have code that creates some pages of a PDF from SVG files, and copies other pages from the original source, and stitches together a final PDF of marked-up (SVG) and non marked-up pages. The conversion from SVG to PDF is done with a call to subprocess.run, and so I run this with a ThreadPoolExecutor since doing so speeds things up significantly.

I'm getting intermittent failures with IndexError: Sequence index out of range, so I investigated.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.8.0-49-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '42.0.8'), PIL=10.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from concurrent.futures import ThreadPoolExecutor
from pypdf import PdfReader


while True:  # Keep going until we crash; this is an intermittent failure mode
    fp = "bigfile.pdf"  # 12 pages, ~50 MB
    num_pages = 12  # known beforehand
    res = dict()
    reader = PdfReader(fp)
    with ThreadPoolExecutor() as executor:
        for page_num in range(num_pages):
            res[page_num] = executor.submit(lambda: reader.pages[page_num - 1])
            # Note: Remove the next two lines and it will fail a lot quicker
            # However these lines are interesting to watch scroll by: Sometimes the
            # number of pages is 0, sometimes it doubles, and often we get
            # "Overwriting cache" errors
            if len(reader.pages) != num_pages:
                print(f"{len(reader.pages)=}")

        for page_num in range(num_pages):
            _ = res[page_num].result()
    print("Succeeded")  # Note if we didn't crash. Without the page number check,
    # the most this will succeed is about 3 or 4 times

Here's a link to the PDF that I used: https://www.dropbox.com/scl/fi/xf2z8fuw2knoxqo81f6nf/archive-dwgs-1952.pdf?rlkey=vy6flraqwod8ogfjcadbyqt6z&st=daqyd3j0&dl=0

Traceback

This is the complete traceback I see:

❯ python3 pypdferr.py
Succeeded
Traceback (most recent call last):
  File "/home/blair/workspace/gfac/pypdferr.py", line 21, in <module>
    _ = res[page_num].result()
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blair/workspace/gfac/pypdferr.py", line 12, in <lambda>
    res[page_num] = executor.submit(lambda: reader.pages[page_num - 1])
                                            ~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/blair/workspace/gfac/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2544, in __getitem__
    raise IndexError("Sequence index out of range")
IndexError: Sequence index out of range

Note that when I run this code with a large-ish file with many pages like this one I get some interesting output:

len(reader.pages)=1
len(reader.pages)=33
len(reader.pages)=43
Succeeded
len(reader.pages)=7
len(reader.pages)=35
len(reader.pages)=44
Succeeded
len(reader.pages)=1
len(reader.pages)=14
len(reader.pages)=14
Succeeded
len(reader.pages)=4
len(reader.pages)=41
len(reader.pages)=50

What this is telling me is that instantiating a PdfReader() returns a PdfReader object, but that object is not all the way initialized as we expect it to be.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-12-29T06:23:53Z

Although not completely identical, this sounds similar to #2686. The issue is that pypdf does not make any guarantees about being thread-safe when using the same container across multiple threads. In your case, all threads seek on the input stream simultaneously (we do some sort of lazy loading here instead of reading everything on initialization), which can lead to all sorts of strange side-effects - some of which you already saw.

The only reliable solution here is to use one container per thread, id est do not share PdfReader instances between multiple threads.

blairfrandeen · 2024-12-29T15:22:43Z

Makes sense. It may be nice to have a non-lazy version of PdfReader for applications like this, although admittedly I don't appreciate the scope of what such a feature might entail.

I solved my own problem by simply enforcing that every page gets loaded before I start sharing the PdfReader between threads:

class EagerReader(PdfReader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        _ = [p for p in self.pages]

I was already subclassing PdfReader in my own application, so it ended up being a one-line fix.

stefan6419846 · 2024-12-30T06:24:18Z

I suspect that it is rather unlikely that we are going to have a non-lazy version of pypdf with proper testing and all the side-effects accounted for. The safest way still is to use one reader per thread.

As this is a rather specific request with no simple solution, I am going to close this for now. If there really is further interest and someone finds a way to properly tackle this, I am open to re-opening this.

blairfrandeen changed the title ~~Intermittent IndexError when accessing PdfReader.pages with ThreadPoolExecutor`~~ Intermittent IndexError when accessing PdfReader.pages with ThreadPoolExecutor Dec 28, 2024

stefan6419846 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent `IndexError` when accessing `PdfReader.pages` with `ThreadPoolExecutor` #3024

Intermittent `IndexError` when accessing `PdfReader.pages` with `ThreadPoolExecutor` #3024

blairfrandeen commented Dec 28, 2024

stefan6419846 commented Dec 29, 2024

blairfrandeen commented Dec 29, 2024

stefan6419846 commented Dec 30, 2024

Intermittent IndexError when accessing PdfReader.pages with ThreadPoolExecutor #3024

Intermittent IndexError when accessing PdfReader.pages with ThreadPoolExecutor #3024

Comments

blairfrandeen commented Dec 28, 2024

Environment

Code + PDF

Traceback

stefan6419846 commented Dec 29, 2024

blairfrandeen commented Dec 29, 2024

stefan6419846 commented Dec 30, 2024

Intermittent `IndexError` when accessing `PdfReader.pages` with `ThreadPoolExecutor` #3024

Intermittent `IndexError` when accessing `PdfReader.pages` with `ThreadPoolExecutor` #3024