-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent IndexError
when accessing PdfReader.pages
with ThreadPoolExecutor
#3024
Comments
when accessing
PdfReader.pages with
ThreadPoolExecutor`IndexError
when accessing PdfReader.pages
with ThreadPoolExecutor
Although not completely identical, this sounds similar to #2686. The issue is that pypdf does not make any guarantees about being thread-safe when using the same container across multiple threads. In your case, all threads seek on the input stream simultaneously (we do some sort of lazy loading here instead of reading everything on initialization), which can lead to all sorts of strange side-effects - some of which you already saw. The only reliable solution here is to use one container per thread, id est do not share |
Makes sense. It may be nice to have a non-lazy version of I solved my own problem by simply enforcing that every page gets loaded before I start sharing the class EagerReader(PdfReader):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
_ = [p for p in self.pages] I was already subclassing |
I suspect that it is rather unlikely that we are going to have a non-lazy version of pypdf with proper testing and all the side-effects accounted for. The safest way still is to use one reader per thread. As this is a rather specific request with no simple solution, I am going to close this for now. If there really is further interest and someone finds a way to properly tackle this, I am open to re-opening this. |
I have code that creates some pages of a PDF from SVG files, and copies other pages from the original source, and stitches together a final PDF of marked-up (SVG) and non marked-up pages. The conversion from SVG to PDF is done with a call to
subprocess.run
, and so I run this with aThreadPoolExecutor
since doing so speeds things up significantly.I'm getting intermittent failures with
IndexError: Sequence index out of range
, so I investigated.Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Here's a link to the PDF that I used: https://www.dropbox.com/scl/fi/xf2z8fuw2knoxqo81f6nf/archive-dwgs-1952.pdf?rlkey=vy6flraqwod8ogfjcadbyqt6z&st=daqyd3j0&dl=0
Traceback
This is the complete traceback I see:
Note that when I run this code with a large-ish file with many pages like this one I get some interesting output:
What this is telling me is that instantiating a
PdfReader()
returns aPdfReader
object, but that object is not all the way initialized as we expect it to be.The text was updated successfully, but these errors were encountered: