You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I'm trying to read a pdf file that contains bold and normal text. The normal text gets read correctly, but all the characters of the bold text are repeated.
For example, BOLD TEXT is read as BBOOLLDD TTEEXXTT.
To Reproduce
filename = "example_files/creatinine.pdf" # cannot share this file because it contains confidential information
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy='hi_res',
pdf_infer_table_structure=True,
languages=["eng"],
)
try:
resp = s.general.partition(req)
print(json.dumps(resp.elements[19], indent=2))
except SDKError as e:
print(e)
Expected behavior
The output of the above code should be as follows:
gauri-nagavkar
changed the title
bug/<short-name> Bold characters get repeated while extracting
bug/Bold characters get repeated while extracting
Jan 14, 2025
Describe the bug
I'm trying to read a pdf file that contains bold and normal text. The normal text gets read correctly, but all the characters of the bold text are repeated.
For example, BOLD TEXT is read as BBOOLLDD TTEEXXTT.
To Reproduce
Expected behavior
The output of the above code should be as follows:
{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }
But since the text >60 is BOLD in the pdf, the output looks like this:
{ "type": "NarrativeText", "element_id": "681ea37fceaad7479d246b8ccc52ec2d", "text": ">60>60", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "parent_id": "e72be637f803a9bf4509b64448ff1133", "filename": "creatinine.pdf" } }
Screenshots
Here's a screenshot from the pdf showing >60 in bold
Here's a screenshot of the code and the output:
The text was updated successfully, but these errors were encountered: