binascii.Error: Non-hexadecimal digit found extracting CMap #2997

neeraj9 · 2024-12-08T17:06:20Z

Error extracting text from document

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-11-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

for page_num in range_of_pages:
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()
        page_text = page_text.strip()
        if not page_text:
            page_num_without_text.append(page_num + 1)
        page_texts.append(page_text)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

bcbba769-386e-41fd-b858-b4ae60d691fe.pdf

Traceback

This is the complete traceback I see:

File "common\fast_pdf_util.py", line 138, in get_pdf_info
    page_text = page.extract_text()
                ^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_page.py", line 2398, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_page.py", line 1868, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\git\pi-embedding\venv\Lib\site-packages\pypdf\_cmap.py", line 56, in build_char_map_from_dict
    encoding, map_dict = get_encoding(ft)
                         ^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 129, in get_encoding
    map_dict, int_entry = _parse_to_unicode(ft)
                          ^^^^^^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 222, in _parse_to_unicode
    process_rg, process_char, multiline_rg = process_cm_line(
                                             ^^^^^^^^^^^^^^^^
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 309, in process_cm_line
    parse_bfchar(line, map_dict, int_entry)
  File "venv\Lib\site-packages\pypdf\_cmap.py", line 380, in parse_bfchar
    map_to = unhexlify(lst[1]).decode(
             ^^^^^^^^^^^^^^^^^
binascii.Error: Non-hexadecimal digit found

Additional debugging.

lst = [b'20', b'kPDF']

    if lst[1] != b".":
            print(f"lst = {lst}")
            map_to = unhexlify(lst[1]).decode(
                "charmap" if len(lst[1]) < 4 else "utf-16-be", "surrogatepass"
            )  # join is here as some cases where the code was split

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-12-08T17:30:15Z

The corresponding font asks to map character codes to kPDF, which is just wrong:

%!PS-Adobe-3.0 Resource-CMap
/CIDInit /ProcSet findresource begin
12 dict begin

begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering () def
/Supplement 0 def
end def

/CMapName /Adobe--000 def
/CMapType 2 def

WMode 0 def

1 begincodespacerange
 00   FF 
endcodespacerange
24 
beginbfchar

 20   kPDF 
 43   kPDF 
 57   kPDF 
 61   kPDF 
 62   kPDF 
 65   kPDF 
 68   kPDF 
 69   kPDF 
 6D   kPDF 
 6E   kPDF 
 6F   kPDF 
 70   kPDF 
 73   kPDF 
 74   kPDF 
 79   kPDF 
 F7   kPDF 
 F8   kPDF 
 F9   kPDF 
 FA   kPDF 
 FB   kPDF 
 FC   kPDF 
 FD   kPDF 
 FE   kPDF 
 FF   kPDF 

endbfchar


endcmap
CMapName currentdict /CMap
defineresource pop
end
end

Reference: Section 9.10.3 of ISO 32000-2:2020.

neeraj9 · 2024-12-08T17:39:24Z

Thanks for the section reference. I am yet to go through the details, but what do you think is the best way forward to extract any text possible from documents violating such cases?

stefan6419846 · 2024-12-08T17:41:52Z

The same as in #2996 (comment). With a broken character map, your text extraction might look wrong otherwise due to wrong/missing character replacement.

neeraj9 · 2024-12-08T17:49:45Z

A quick read of section 9.10 in PDF 32000-1:2008.pdf (spec) / PDF ISO 32000-2

"...If these methods fail to produce a Unicode value, there is no way to determine what the character code
represents in which case a PDF processor may choose a character code of their choosing. ..."

It will be good to understand fallback scheme, so that some text recovery is possible rather than tracking original authors of documents (in public domain) which may not be possible always.

stefan6419846 · 2024-12-08T18:51:35Z

You are of course always invited to propose a corresponding PR.

stefan6419846 · 2024-12-19T14:43:04Z

I just had another look at this: pypdf is not able to extract the correct text after fixing (apart from the footer "Aloaha PDF Suite Freeware Edition: http://www.aloaha.com"), while pdftotext can. For this reason, I am not submitting a PR for now as this requires further analysis to appropriately handle this.

stefan6419846 added the is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. label Dec 8, 2024

stefan6419846 changed the title ~~binascii.Error: Non-hexadecimal digit found~~ binascii.Error: Non-hexadecimal digit found extracting CMap Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binascii.Error: Non-hexadecimal digit found extracting CMap #2997

binascii.Error: Non-hexadecimal digit found extracting CMap #2997

neeraj9 commented Dec 8, 2024 •

edited

Loading

stefan6419846 commented Dec 8, 2024

neeraj9 commented Dec 8, 2024 •

edited

Loading

stefan6419846 commented Dec 8, 2024

neeraj9 commented Dec 8, 2024

stefan6419846 commented Dec 8, 2024

stefan6419846 commented Dec 19, 2024

binascii.Error: Non-hexadecimal digit found extracting CMap #2997

binascii.Error: Non-hexadecimal digit found extracting CMap #2997

Comments

neeraj9 commented Dec 8, 2024 • edited Loading

Environment

Code + PDF

Traceback

Additional debugging.

stefan6419846 commented Dec 8, 2024

neeraj9 commented Dec 8, 2024 • edited Loading

stefan6419846 commented Dec 8, 2024

neeraj9 commented Dec 8, 2024

stefan6419846 commented Dec 8, 2024

stefan6419846 commented Dec 19, 2024

neeraj9 commented Dec 8, 2024 •

edited

Loading

neeraj9 commented Dec 8, 2024 •

edited

Loading