Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF 解析出现错误 #1282

Open
Skyer19 opened this issue Dec 12, 2024 · 1 comment
Open

PDF 解析出现错误 #1282

Skyer19 opened this issue Dec 12, 2024 · 1 comment

Comments

@Skyer19
Copy link

Skyer19 commented Dec 12, 2024

您好,

我使用MinerU对部分PDF文件进行解析时,会提示如下:

INFO:datasets:PyTorch version 2.5.1 available.
import tensorrt_llm failed, if do not use tensorrt, ignore this message
import lmdeploy failed, if do not use lmdeploy, ignore this message
2024-12-12 16:35:22.202 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars_by_pymupdf:84 - uffd_count: 0, text_len: 7, uffd_chars_radio: 0.0
2024-12-12 16:35:22.202 | WARNING  | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: True, by_img_narrow_strips: True, by_invalid_chars: True
2024-12-12 16:35:22.203 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:78 - DocAnalysis init, this may take some times, layout_model: doclayout_yolo, apply_formula: False, apply_ocr: True, apply_table: False, table_model: rapid_table, lang: None
2024-12-12 16:35:22.203 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:91 - using device: cuda
2024-12-12 16:35:22.203 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:95 - using models_dir: /xxxx/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/38e484355b9acf5654030286bf72490e27842a3c/models
2024-12-12 16:35:22.521 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:170 - DocAnalysis init done!
2024-12-12 16:35:22.521 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:131 - model init cost: 0.3190760612487793
2024-12-12 16:35:23.025 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:184 - layout detection time: 0.41
2024-12-12 16:35:25.467 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:230 - ocr time: 2.44
2024-12-12 16:35:25.467 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 0, page total time: 2.85-----
2024-12-12 16:35:25.495 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:184 - layout detection time: 0.03
2024-12-12 16:35:27.997 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:230 - ocr time: 2.5
2024-12-12 16:35:27.997 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 1, page total time: 2.53-----
2024-12-12 16:35:28.117 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:178 - gc time: 0.12
2024-12-12 16:35:28.117 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:182 - doc analyze time: 5.5, speed: 0.36 pages/second
2024-12-12 16:35:28.118 | ERROR    | __main__:pdf_parse_main:137 - 'NoneType' object is not subscriptable
Traceback (most recent call last):

  File "/xxxxx/MinerU/magic_pdf_parse_main.py", line 147, in <module>
    pdf_parse_main(file_path)
    │              └ '/xxxx/8.pdf'
    └ <function pdf_parse_main at 0x7c7c76755800>

> File "/xxxx/MinerU/magic_pdf_parse_main.py", line 127, in pdf_parse_main
    content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode='none')
                   │    │                  └ 'images'
                   │    └ <function UNIPipe.pipe_mk_uni_format at 0x7c7b5b528180>
                   └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7c7b5c89ffd0>

  File "/xxxxx/MinerU/magic_pdf/pipe/UNIPipe.py", line 56, in pipe_mk_uni_format
    result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
                                        │                └ 'none'
                                        └ 'images'

  File "/xxxx/MinerU/magic_pdf/pipe/AbsPipe.py", line 50, in pipe_mk_uni_format
    content_list = AbsPipe.mk_uni_format(self.get_compress_pdf_mid_data(), img_parent_path, drop_mode)
                   │       │             │    │                            │                └ 'none'
                   │       │             │    │                            └ 'images'
                   │       │             │    └ <function AbsPipe.get_compress_pdf_mid_data at 0x7c7beffa8540>
                   │       │             └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7c7b5c89ffd0>
                   │       └ <staticmethod(<function AbsPipe.mk_uni_format at 0x7c7beffa8b80>)>
                   └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'>

  File "/xxxxx/MinerU/magic_pdf/pipe/AbsPipe.py", line 88, in mk_uni_format
    pdf_info_list = pdf_mid_data['pdf_info']
                    └ None

TypeError: 'NoneType' object is not subscriptable
  • 我是使用API在本地进行运行程序,没有使用pip安装package。
  • 本错误只在解析部分PDF时出现。
  • Operating system: Linux
  • Python version: 3.11

请问是程序本身bug还是PDF文件的问题,是否有处理办法?谢谢。

@myhloli
Copy link
Collaborator

myhloli commented Dec 12, 2024

能上传一份出问题的pdf文件吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants