Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format pdf reader #336

Merged
merged 10 commits into from
Jan 15, 2025
Merged

format pdf reader #336

merged 10 commits into from
Jan 15, 2025

Conversation

Ceceliachenen
Copy link
Collaborator

No description provided.

image_writer = FileBasedDataWriter(temp_file_path)
reader1 = FileBasedDataReader("")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, 为啥命名叫reader1,感觉可以叫file_reader? 这个参数""是什么意思呢,感觉可以加一个注释

max_tokens=200,
n=1,
output_content.extend(page_markdown)
markdwon_content = "\n\n".join(output_content)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit markdown

markdown_content[:new_start]
+ ocr_content
+ markdown_content[new_start:]
def create_markdwon(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, markdown


def process_table(self, markdown_content, json_data):
ocr_count = 0
def create_page_markdwon(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, typo, markdown

@@ -261,13 +232,15 @@ def post_process_multi_level_headings(self, json_data, md_content):
new_title = title_level + title_text
md_content = re.sub(re.escape(old_title), new_title, md_content)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这里可以不替换,保留title, text, table等元素列表,title_list再保存一下在列表里的index,处理title_list的时候找到index直接赋值,最后再concat。

new_title = title_level + title_text
md_content = re.sub(re.escape(old_title), new_title, md_content)
new_title_list.append(new_title)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉可以加一个log,把old_titles,转成了新的titles,也便于以后debug

Copy link

github-actions bot commented Jan 15, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
8603 4292 50% 40% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
src/pai_rag/integrations/nodeparsers/pai/pai_markdown_parser.py 83% 🟢
src/pai_rag/integrations/readers/pai_pdf_reader.py 83% 🟢
TOTAL 83% 🟢

updated for commit: 3ee631d by action🐍

@moria97 moria97 merged commit 4d75d9c into feature Jan 15, 2025
2 checks passed
@moria97 moria97 deleted the personal/ranxia/reformat_pdf_reader branch January 15, 2025 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants