-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
binascii.Error: Non-hexadecimal digit found extracting CMap #2997
Comments
The corresponding font asks to map character codes to
Reference: Section 9.10.3 of ISO 32000-2:2020. |
Thanks for the section reference. I am yet to go through the details, but what do you think is the best way forward to extract any text possible from documents violating such cases? |
The same as in #2996 (comment). With a broken character map, your text extraction might look wrong otherwise due to wrong/missing character replacement. |
A quick read of section 9.10 in PDF 32000-1:2008.pdf (spec) / PDF ISO 32000-2 "...If these methods fail to produce a Unicode value, there is no way to determine what the character code It will be good to understand fallback scheme, so that some text recovery is possible rather than tracking original authors of documents (in public domain) which may not be possible always. |
You are of course always invited to propose a corresponding PR. |
I just had another look at this: pypdf is not able to extract the correct text after fixing (apart from the footer "Aloaha PDF Suite Freeware Edition: http://www.aloaha.com"), while |
Error extracting text from document
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
bcbba769-386e-41fd-b858-b4ae60d691fe.pdf
Traceback
This is the complete traceback I see:
Additional debugging.
The text was updated successfully, but these errors were encountered: