'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932
Labels
is-regression
Regression introduced as a side-effect of another change
workflow-text-extraction
From a users perspective, text extraction is the affected feature/workflow
Extracting text used to extract all words, now at least one is missing from the bounding box
Environment
Both Linux and Windows.
v5.0.1 has been tested and is fine.
Code + PDF
With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf
Running this snippet:
With pypdf v5.1.0, the output is:
With pypdf v5.0.1, the output is:
The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.
The text was updated successfully, but these errors were encountered: