Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

Open
remi-braun opened this issue Nov 4, 2024 · 1 comment
Open
Labels
is-regression Regression introduced as a side-effect of another change workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@remi-braun
Copy link

remi-braun commented Nov 4, 2024

Extracting text used to extract all words, now at least one is missing from the bounding box

Environment

Both Linux and Windows.
v5.0.1 has been tested and is fine.

Code + PDF

With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf

def extract_map_text(
    page: PageObject,
    x_min: float = 0,
    x_max: float = 1,
    y_min: float = 0,
    y_max: float = 1,
    sep=";",
):
    """
    Extract the text from the given page (in PDF)
    Args:
        page (PageObject): PDF page
        x_thresh (float): Threshold (%age of total width) on x-axis to read the text only on the right of it

    Returns:
        str: Extracted text

    """
    parts = []

    def visitor_right(text, cm, tm, font_dict, font_size):
        x = tm[4]
        y = tm[5]
        in_window = (
            float(x_max * float(page.cropbox.right))
            > x
            > float(x_min * float(page.cropbox.right))
        ) and (
            float(y_max * float(page.cropbox.top))
            > y
            > float(y_min * float(page.cropbox.top))
        )
        if in_window and text not in ["!", "", " "]:
            parts.append(text)

    page.extract_text(orientations=0, visitor_text=visitor_right)
    page_txt = (
        sep.join([p for p in parts if p not in ["\n"]])
        .replace("\n", " ")
        .replace("\x00", "")
        .replace("\xa0", " ")
    )
    return page_txt

Running this snippet:

extract_map_text(
    page, x_min=0.8, y_min=0.6, y_max=0.8, sep=" "
).replace("  ", " ")

With pypdf v5.1.0, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

With pypdf v5.0.1, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. Road 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.

@remi-braun remi-braun changed the title extract_text 'extract_text' text matrix seems to be sometimes broken with v5.1.0 Nov 4, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the report. We had some changes to the text extraction code for version 5.1.0, although I am not really sure why it would affect these positions.

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-regression Regression introduced as a side-effect of another change workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants