format pdf reader #336

Ceceliachenen · 2025-01-10T03:06:08Z

No description provided.

moria97 · 2025-01-10T08:51:21Z

src/pai_rag/integrations/readers/pai_pdf_reader.py

                image_writer = FileBasedDataWriter(temp_file_path)
+                reader1 = FileBasedDataReader("")


nit, 为啥命名叫reader1，感觉可以叫file_reader? 这个参数""是什么意思呢，感觉可以加一个注释

moria97 · 2025-01-10T08:52:11Z

src/pai_rag/integrations/readers/pai_pdf_reader.py

-            max_tokens=200,
-            n=1,
+            output_content.extend(page_markdown)
+        markdwon_content = "\n\n".join(output_content)


nit markdown

moria97 · 2025-01-10T08:52:30Z

src/pai_rag/integrations/readers/pai_pdf_reader.py

-                markdown_content[:new_start]
-                + ocr_content
-                + markdown_content[new_start:]
+    def create_markdwon(


nit, markdown

moria97 · 2025-01-10T08:52:43Z

src/pai_rag/integrations/readers/pai_pdf_reader.py


-    def process_table(self, markdown_content, json_data):
-        ocr_count = 0
+    def create_page_markdwon(


nit, typo, markdown

moria97 · 2025-01-10T09:05:51Z

src/pai_rag/integrations/readers/pai_pdf_reader.py

@@ -261,13 +232,15 @@ def post_process_multi_level_headings(self, json_data, md_content):
            new_title = title_level + title_text
            md_content = re.sub(re.escape(old_title), new_title, md_content)


感觉这里可以不替换，保留title, text, table等元素列表，title_list再保存一下在列表里的index，处理title_list的时候找到index直接赋值，最后再concat。

moria97 · 2025-01-14T12:28:22Z

src/pai_rag/integrations/readers/pai_pdf_reader.py

            new_title = title_level + title_text
-            md_content = re.sub(re.escape(old_title), new_title, md_content)
+            new_title_list.append(new_title)


感觉可以加一个log，把old_titles，转成了新的titles，也便于以后debug

github-actions · 2025-01-15T02:35:59Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
8603	4292	50%	40%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
src/pai_rag/integrations/nodeparsers/pai/pai_markdown_parser.py	83%	🟢
src/pai_rag/integrations/readers/pai_pdf_reader.py	83%	🟢
TOTAL	83%	🟢

updated for commit: 3ee631d by action🐍

format pdf reader

d093d59

moria97 reviewed Jan 10, 2025

View reviewed changes

Ceceliachenen added 2 commits January 14, 2025 17:31

reformat pdf reader

8fe3c94

update format

a7f33a0

moria97 approved these changes Jan 14, 2025

View reviewed changes

Ceceliachenen added 4 commits January 14, 2025 20:54

update format

c6f1d56

fix test

d6ebf47

Merge branch 'feature' into personal/ranxia/reformat_pdf_reader

9573305

remove empty node

cdb6813

Ceceliachenen added 3 commits January 15, 2025 11:20

update test

33be251

fix md parser

47a0814

fix test

3ee631d

moria97 merged commit 4d75d9c into feature Jan 15, 2025
2 checks passed

moria97 deleted the personal/ranxia/reformat_pdf_reader branch January 15, 2025 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

format pdf reader #336

format pdf reader #336

Ceceliachenen commented Jan 10, 2025

moria97 Jan 10, 2025

moria97 Jan 10, 2025

moria97 Jan 10, 2025

moria97 Jan 10, 2025

moria97 Jan 10, 2025

moria97 Jan 14, 2025

github-actions bot commented Jan 15, 2025 •

edited

Loading

		image_writer = FileBasedDataWriter(temp_file_path)
		reader1 = FileBasedDataReader("")

		@@ -261,13 +232,15 @@ def post_process_multi_level_headings(self, json_data, md_content):
		new_title = title_level + title_text
		md_content = re.sub(re.escape(old_title), new_title, md_content)

format pdf reader #336

format pdf reader #336

Conversation

Ceceliachenen commented Jan 10, 2025

moria97 Jan 10, 2025

Choose a reason for hiding this comment

moria97 Jan 10, 2025

Choose a reason for hiding this comment

moria97 Jan 10, 2025

Choose a reason for hiding this comment

moria97 Jan 10, 2025

Choose a reason for hiding this comment

moria97 Jan 10, 2025

Choose a reason for hiding this comment

moria97 Jan 14, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 15, 2025 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

github-actions bot commented Jan 15, 2025 •

edited

Loading