'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

remi-braun · 2024-11-04T15:56:33Z

Extracting text used to extract all words, now at least one is missing from the bounding box

Environment

Both Linux and Windows.
v5.0.1 has been tested and is fine.

Code + PDF

With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf

def extract_map_text(
    page: PageObject,
    x_min: float = 0,
    x_max: float = 1,
    y_min: float = 0,
    y_max: float = 1,
    sep=";",
):
    """
    Extract the text from the given page (in PDF)
    Args:
        page (PageObject): PDF page
        x_thresh (float): Threshold (%age of total width) on x-axis to read the text only on the right of it

    Returns:
        str: Extracted text

    """
    parts = []

    def visitor_right(text, cm, tm, font_dict, font_size):
        x = tm[4]
        y = tm[5]
        in_window = (
            float(x_max * float(page.cropbox.right))
            > x
            > float(x_min * float(page.cropbox.right))
        ) and (
            float(y_max * float(page.cropbox.top))
            > y
            > float(y_min * float(page.cropbox.top))
        )
        if in_window and text not in ["!", "", " "]:
            parts.append(text)

    page.extract_text(orientations=0, visitor_text=visitor_right)
    page_txt = (
        sep.join([p for p in parts if p not in ["\n"]])
        .replace("\n", " ")
        .replace("\x00", "")
        .replace("\xa0", " ")
    )
    return page_txt

Running this snippet:

extract_map_text(
    page, x_min=0.8, y_min=0.6, y_max=0.8, sep=" "
).replace("  ", " ")

With pypdf v5.1.0, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

With pypdf v5.0.1, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. Road 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-11-04T16:18:05Z

Thanks for the report. We had some changes to the text extraction code for version 5.1.0, although I am not really sure why it would affect these positions.

remi-braun changed the title ~~extract_text~~ 'extract_text' text matrix seems to be sometimes broken with v5.1.0 Nov 4, 2024

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

remi-braun commented Nov 4, 2024 •

edited

Loading

stefan6419846 commented Nov 4, 2024

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

Comments

remi-braun commented Nov 4, 2024 • edited Loading

Environment

Code + PDF

stefan6419846 commented Nov 4, 2024

remi-braun commented Nov 4, 2024 •

edited

Loading