Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/Bold characters get repeated while extracting #3864

Open
gauri-nagavkar opened this issue Jan 14, 2025 · 0 comments
Open

bug/Bold characters get repeated while extracting #3864

gauri-nagavkar opened this issue Jan 14, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@gauri-nagavkar
Copy link

Describe the bug
I'm trying to read a pdf file that contains bold and normal text. The normal text gets read correctly, but all the characters of the bold text are repeated.

For example, BOLD TEXT is read as BBOOLLDD TTEEXXTT.

To Reproduce

filename = "example_files/creatinine.pdf" # cannot share this file because it contains confidential information
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[19], indent=2))
except SDKError as e:
    print(e)

Expected behavior
The output of the above code should be as follows:

{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }

But since the text >60 is BOLD in the pdf, the output looks like this:

{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60>60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }

Screenshots
Here's a screenshot from the pdf showing >60 in bold
image

Here's a screenshot of the code and the output:
image

@gauri-nagavkar gauri-nagavkar added the bug Something isn't working label Jan 14, 2025
@gauri-nagavkar gauri-nagavkar changed the title bug/<short-name> Bold characters get repeated while extracting bug/Bold characters get repeated while extracting Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant