Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binascii.Error: Non-hexadecimal digit found extracting CMap #2997

Open
neeraj9 opened this issue Dec 8, 2024 · 6 comments
Open

binascii.Error: Non-hexadecimal digit found extracting CMap #2997

neeraj9 opened this issue Dec 8, 2024 · 6 comments
Labels
is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered.

Comments

@neeraj9
Copy link

neeraj9 commented Dec 8, 2024

Error extracting text from document

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

for page_num in range_of_pages:
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()
        page_text = page_text.strip()
        if not page_text:
            page_num_without_text.append(page_num + 1)
        page_texts.append(page_text)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

bcbba769-386e-41fd-b858-b4ae60d691fe.pdf

Traceback

This is the complete traceback I see:

File "common\fast_pdf_util.py", line 138, in get_pdf_info
    page_text = page.extract_text()
                ^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_page.py", line 2398, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git\pi-embedding\venv\Lib\site-packages\pypdf\_cmap.py", line 56, in build_char_map_from_dict
    encoding, map_dict = get_encoding(ft)
                         ^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 129, in get_encoding
    map_dict, int_entry = _parse_to_unicode(ft)
                          ^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 222, in _parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
                                             ^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 309, in process_cm_line
    parse_bfchar(line, map_dict, int_entry)
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 380, in parse_bfchar
    map_to = unhexlify(lst[1]).decode(
             ^^^^^^^^^^^^^^^^^
binascii.Error: Non-hexadecimal digit found

Additional debugging.

lst = [b'20', b'kPDF']
    if lst[1] != b".":
            print(f"lst = {lst}")
            map_to = unhexlify(lst[1]).decode(
                "charmap" if len(lst[1]) < 4 else "utf-16-be", "surrogatepass"
            )  # join is here as some cases where the code was split
@stefan6419846
Copy link
Collaborator

The corresponding font asks to map character codes to kPDF, which is just wrong:

%!PS-Adobe-3.0 Resource-CMap
/CIDInit /ProcSet findresource begin
12 dict begin

begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering () def
/Supplement 0 def
end def

/CMapName /Adobe--000 def
/CMapType 2 def

WMode 0 def

1 begincodespacerange
 00   FF 
endcodespacerange
24 
beginbfchar

 20   kPDF 
 43   kPDF 
 57   kPDF 
 61   kPDF 
 62   kPDF 
 65   kPDF 
 68   kPDF 
 69   kPDF 
 6D   kPDF 
 6E   kPDF 
 6F   kPDF 
 70   kPDF 
 73   kPDF 
 74   kPDF 
 79   kPDF 
 F7   kPDF 
 F8   kPDF 
 F9   kPDF 
 FA   kPDF 
 FB   kPDF 
 FC   kPDF 
 FD   kPDF 
 FE   kPDF 
 FF   kPDF 

endbfchar


endcmap
CMapName currentdict /CMap
defineresource pop
end
end

Reference: Section 9.10.3 of ISO 32000-2:2020.

@stefan6419846 stefan6419846 added the is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. label Dec 8, 2024
@neeraj9
Copy link
Author

neeraj9 commented Dec 8, 2024

Thanks for the section reference. I am yet to go through the details, but what do you think is the best way forward to extract any text possible from documents violating such cases?

@stefan6419846
Copy link
Collaborator

The same as in #2996 (comment). With a broken character map, your text extraction might look wrong otherwise due to wrong/missing character replacement.

@neeraj9
Copy link
Author

neeraj9 commented Dec 8, 2024

A quick read of section 9.10 in PDF 32000-1:2008.pdf (spec) / PDF ISO 32000-2

"...If these methods fail to produce a Unicode value, there is no way to determine what the character code
represents in which case a PDF processor may choose a character code of their choosing. ..."

It will be good to understand fallback scheme, so that some text recovery is possible rather than tracking original authors of documents (in public domain) which may not be possible always.

@stefan6419846 stefan6419846 changed the title binascii.Error: Non-hexadecimal digit found binascii.Error: Non-hexadecimal digit found extracting CMap Dec 8, 2024
@stefan6419846
Copy link
Collaborator

You are of course always invited to propose a corresponding PR.

@stefan6419846
Copy link
Collaborator

I just had another look at this: pypdf is not able to extract the correct text after fixing (apart from the footer "Aloaha PDF Suite Freeware Edition: http://www.aloaha.com"), while pdftotext can. For this reason, I am not submitting a PR for now as this requires further analysis to appropriately handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered.
Projects
None yet
Development

No branches or pull requests

2 participants