PDF 解析出现错误 #1282

Skyer19 · 2024-12-12T15:49:21Z

您好，

我使用MinerU对部分PDF文件进行解析时，会提示如下：

INFO:datasets:PyTorch version 2.5.1 available.
import tensorrt_llm failed, if do not use tensorrt, ignore this message
import lmdeploy failed, if do not use lmdeploy, ignore this message
2024-12-12 16:35:22.202 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars_by_pymupdf:84 - uffd_count: 0, text_len: 7, uffd_chars_radio: 0.0
2024-12-12 16:35:22.202 | WARNING  | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: True, by_img_narrow_strips: True, by_invalid_chars: True
2024-12-12 16:35:22.203 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:78 - DocAnalysis init, this may take some times, layout_model: doclayout_yolo, apply_formula: False, apply_ocr: True, apply_table: False, table_model: rapid_table, lang: None
2024-12-12 16:35:22.203 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:91 - using device: cuda
2024-12-12 16:35:22.203 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:95 - using models_dir: /xxxx/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/38e484355b9acf5654030286bf72490e27842a3c/models
2024-12-12 16:35:22.521 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:170 - DocAnalysis init done!
2024-12-12 16:35:22.521 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:131 - model init cost: 0.3190760612487793
2024-12-12 16:35:23.025 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:184 - layout detection time: 0.41
2024-12-12 16:35:25.467 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:230 - ocr time: 2.44
2024-12-12 16:35:25.467 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 0, page total time: 2.85-----
2024-12-12 16:35:25.495 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:184 - layout detection time: 0.03
2024-12-12 16:35:27.997 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:230 - ocr time: 2.5
2024-12-12 16:35:27.997 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 1, page total time: 2.53-----
2024-12-12 16:35:28.117 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:178 - gc time: 0.12
2024-12-12 16:35:28.117 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:182 - doc analyze time: 5.5, speed: 0.36 pages/second

2024-12-12 16:35:28.118 | ERROR    | __main__:pdf_parse_main:137 - 'NoneType' object is not subscriptable
Traceback (most recent call last):

  File "/xxxxx/MinerU/magic_pdf_parse_main.py", line 147, in <module>
    pdf_parse_main(file_path)
    │              └ '/xxxx/8.pdf'
    └ <function pdf_parse_main at 0x7c7c76755800>

> File "/xxxx/MinerU/magic_pdf_parse_main.py", line 127, in pdf_parse_main
    content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode='none')
                   │    │                  └ 'images'
                   │    └ <function UNIPipe.pipe_mk_uni_format at 0x7c7b5b528180>
                   └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7c7b5c89ffd0>

  File "/xxxxx/MinerU/magic_pdf/pipe/UNIPipe.py", line 56, in pipe_mk_uni_format
    result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
                                        │                └ 'none'
                                        └ 'images'

  File "/xxxx/MinerU/magic_pdf/pipe/AbsPipe.py", line 50, in pipe_mk_uni_format
    content_list = AbsPipe.mk_uni_format(self.get_compress_pdf_mid_data(), img_parent_path, drop_mode)
                   │       │             │    │                            │                └ 'none'
                   │       │             │    │                            └ 'images'
                   │       │             │    └ <function AbsPipe.get_compress_pdf_mid_data at 0x7c7beffa8540>
                   │       │             └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7c7b5c89ffd0>
                   │       └ <staticmethod(<function AbsPipe.mk_uni_format at 0x7c7beffa8b80>)>
                   └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'>

  File "/xxxxx/MinerU/magic_pdf/pipe/AbsPipe.py", line 88, in mk_uni_format
    pdf_info_list = pdf_mid_data['pdf_info']
                    └ None

TypeError: 'NoneType' object is not subscriptable

我是使用API在本地进行运行程序，没有使用pip安装package。
本错误只在解析部分PDF时出现。
Operating system: Linux
Python version: 3.11

请问是程序本身bug还是PDF文件的问题，是否有处理办法？谢谢。

The text was updated successfully, but these errors were encountered:

myhloli · 2024-12-12T15:58:12Z

能上传一份出问题的pdf文件吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF 解析出现错误 #1282

PDF 解析出现错误 #1282

Skyer19 commented Dec 12, 2024 •

edited

Loading

myhloli commented Dec 12, 2024

PDF 解析出现错误 #1282

PDF 解析出现错误 #1282

Comments

Skyer19 commented Dec 12, 2024 • edited Loading

myhloli commented Dec 12, 2024

Skyer19 commented Dec 12, 2024 •

edited

Loading