Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent IndexError when accessing PdfReader.pages with ThreadPoolExecutor #3024

Closed
blairfrandeen opened this issue Dec 28, 2024 · 3 comments

Comments

@blairfrandeen
Copy link

I have code that creates some pages of a PDF from SVG files, and copies other pages from the original source, and stitches together a final PDF of marked-up (SVG) and non marked-up pages. The conversion from SVG to PDF is done with a call to subprocess.run, and so I run this with a ThreadPoolExecutor since doing so speeds things up significantly.

I'm getting intermittent failures with IndexError: Sequence index out of range, so I investigated.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.8.0-49-generic-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '42.0.8'), PIL=10.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from concurrent.futures import ThreadPoolExecutor
from pypdf import PdfReader


while True:  # Keep going until we crash; this is an intermittent failure mode
    fp = "bigfile.pdf"  # 12 pages, ~50 MB
    num_pages = 12  # known beforehand
    res = dict()
    reader = PdfReader(fp)
    with ThreadPoolExecutor() as executor:
        for page_num in range(num_pages):
            res[page_num] = executor.submit(lambda: reader.pages[page_num - 1])
            # Note: Remove the next two lines and it will fail a lot quicker
            # However these lines are interesting to watch scroll by: Sometimes the
            # number of pages is 0, sometimes it doubles, and often we get
            # "Overwriting cache" errors
            if len(reader.pages) != num_pages:
                print(f"{len(reader.pages)=}")

        for page_num in range(num_pages):
            _ = res[page_num].result()
    print("Succeeded")  # Note if we didn't crash. Without the page number check,
    # the most this will succeed is about 3 or 4 times

Here's a link to the PDF that I used: https://www.dropbox.com/scl/fi/xf2z8fuw2knoxqo81f6nf/archive-dwgs-1952.pdf?rlkey=vy6flraqwod8ogfjcadbyqt6z&st=daqyd3j0&dl=0

Traceback

This is the complete traceback I see:

❯ python3 pypdferr.py
Succeeded
Traceback (most recent call last):
  File "/home/blair/workspace/gfac/pypdferr.py", line 21, in <module>
    _ = res[page_num].result()
        ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blair/workspace/gfac/pypdferr.py", line 12, in <lambda>
    res[page_num] = executor.submit(lambda: reader.pages[page_num - 1])
                                            ~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/blair/workspace/gfac/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2544, in __getitem__
    raise IndexError("Sequence index out of range")
IndexError: Sequence index out of range

Note that when I run this code with a large-ish file with many pages like this one I get some interesting output:

len(reader.pages)=1
len(reader.pages)=33
len(reader.pages)=43
Succeeded
len(reader.pages)=7
len(reader.pages)=35
len(reader.pages)=44
Succeeded
len(reader.pages)=1
len(reader.pages)=14
len(reader.pages)=14
Succeeded
len(reader.pages)=4
len(reader.pages)=41
len(reader.pages)=50

What this is telling me is that instantiating a PdfReader() returns a PdfReader object, but that object is not all the way initialized as we expect it to be.

@blairfrandeen blairfrandeen changed the title Intermittent IndexError when accessing PdfReader.pages with ThreadPoolExecutor` Intermittent IndexError when accessing PdfReader.pages with ThreadPoolExecutor Dec 28, 2024
@stefan6419846
Copy link
Collaborator

Although not completely identical, this sounds similar to #2686. The issue is that pypdf does not make any guarantees about being thread-safe when using the same container across multiple threads. In your case, all threads seek on the input stream simultaneously (we do some sort of lazy loading here instead of reading everything on initialization), which can lead to all sorts of strange side-effects - some of which you already saw.

The only reliable solution here is to use one container per thread, id est do not share PdfReader instances between multiple threads.

@blairfrandeen
Copy link
Author

Makes sense. It may be nice to have a non-lazy version of PdfReader for applications like this, although admittedly I don't appreciate the scope of what such a feature might entail.

I solved my own problem by simply enforcing that every page gets loaded before I start sharing the PdfReader between threads:

class EagerReader(PdfReader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        _ = [p for p in self.pages]

I was already subclassing PdfReader in my own application, so it ended up being a one-line fix.

@stefan6419846
Copy link
Collaborator

I suspect that it is rather unlikely that we are going to have a non-lazy version of pypdf with proper testing and all the side-effects accounted for. The safest way still is to use one reader per thread.

As this is a rather specific request with no simple solution, I am going to close this for now. If there really is further interest and someone finds a way to properly tackle this, I am open to re-opening this.

@stefan6419846 stefan6419846 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants