We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,
我使用MinerU对部分PDF文件进行解析时,会提示如下:
INFO:datasets:PyTorch version 2.5.1 available. import tensorrt_llm failed, if do not use tensorrt, ignore this message import lmdeploy failed, if do not use lmdeploy, ignore this message 2024-12-12 16:35:22.202 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars_by_pymupdf:84 - uffd_count: 0, text_len: 7, uffd_chars_radio: 0.0 2024-12-12 16:35:22.202 | WARNING | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: True, by_img_narrow_strips: True, by_invalid_chars: True 2024-12-12 16:35:22.203 | INFO | magic_pdf.model.pdf_extract_kit:__init__:78 - DocAnalysis init, this may take some times, layout_model: doclayout_yolo, apply_formula: False, apply_ocr: True, apply_table: False, table_model: rapid_table, lang: None 2024-12-12 16:35:22.203 | INFO | magic_pdf.model.pdf_extract_kit:__init__:91 - using device: cuda 2024-12-12 16:35:22.203 | INFO | magic_pdf.model.pdf_extract_kit:__init__:95 - using models_dir: /xxxx/.cache/huggingface/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/38e484355b9acf5654030286bf72490e27842a3c/models 2024-12-12 16:35:22.521 | INFO | magic_pdf.model.pdf_extract_kit:__init__:170 - DocAnalysis init done! 2024-12-12 16:35:22.521 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:131 - model init cost: 0.3190760612487793 2024-12-12 16:35:23.025 | INFO | magic_pdf.model.pdf_extract_kit:__call__:184 - layout detection time: 0.41 2024-12-12 16:35:25.467 | INFO | magic_pdf.model.pdf_extract_kit:__call__:230 - ocr time: 2.44 2024-12-12 16:35:25.467 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 0, page total time: 2.85----- 2024-12-12 16:35:25.495 | INFO | magic_pdf.model.pdf_extract_kit:__call__:184 - layout detection time: 0.03 2024-12-12 16:35:27.997 | INFO | magic_pdf.model.pdf_extract_kit:__call__:230 - ocr time: 2.5 2024-12-12 16:35:27.997 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:168 - -----page_id : 1, page total time: 2.53----- 2024-12-12 16:35:28.117 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:178 - gc time: 0.12 2024-12-12 16:35:28.117 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:182 - doc analyze time: 5.5, speed: 0.36 pages/second
2024-12-12 16:35:28.118 | ERROR | __main__:pdf_parse_main:137 - 'NoneType' object is not subscriptable Traceback (most recent call last): File "/xxxxx/MinerU/magic_pdf_parse_main.py", line 147, in <module> pdf_parse_main(file_path) │ └ '/xxxx/8.pdf' └ <function pdf_parse_main at 0x7c7c76755800> > File "/xxxx/MinerU/magic_pdf_parse_main.py", line 127, in pdf_parse_main content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode='none') │ │ └ 'images' │ └ <function UNIPipe.pipe_mk_uni_format at 0x7c7b5b528180> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7c7b5c89ffd0> File "/xxxxx/MinerU/magic_pdf/pipe/UNIPipe.py", line 56, in pipe_mk_uni_format result = super().pipe_mk_uni_format(img_parent_path, drop_mode) │ └ 'none' └ 'images' File "/xxxx/MinerU/magic_pdf/pipe/AbsPipe.py", line 50, in pipe_mk_uni_format content_list = AbsPipe.mk_uni_format(self.get_compress_pdf_mid_data(), img_parent_path, drop_mode) │ │ │ │ │ └ 'none' │ │ │ │ └ 'images' │ │ │ └ <function AbsPipe.get_compress_pdf_mid_data at 0x7c7beffa8540> │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7c7b5c89ffd0> │ └ <staticmethod(<function AbsPipe.mk_uni_format at 0x7c7beffa8b80>)> └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'> File "/xxxxx/MinerU/magic_pdf/pipe/AbsPipe.py", line 88, in mk_uni_format pdf_info_list = pdf_mid_data['pdf_info'] └ None TypeError: 'NoneType' object is not subscriptable
请问是程序本身bug还是PDF文件的问题,是否有处理办法?谢谢。
The text was updated successfully, but these errors were encountered:
能上传一份出问题的pdf文件吗
Sorry, something went wrong.
No branches or pull requests
您好,
我使用MinerU对部分PDF文件进行解析时,会提示如下:
请问是程序本身bug还是PDF文件的问题,是否有处理办法?谢谢。
The text was updated successfully, but these errors were encountered: