-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format pdf reader #336
format pdf reader #336
Conversation
image_writer = FileBasedDataWriter(temp_file_path) | ||
reader1 = FileBasedDataReader("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, 为啥命名叫reader1,感觉可以叫file_reader? 这个参数""是什么意思呢,感觉可以加一个注释
max_tokens=200, | ||
n=1, | ||
output_content.extend(page_markdown) | ||
markdwon_content = "\n\n".join(output_content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit markdown
markdown_content[:new_start] | ||
+ ocr_content | ||
+ markdown_content[new_start:] | ||
def create_markdwon( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, markdown
|
||
def process_table(self, markdown_content, json_data): | ||
ocr_count = 0 | ||
def create_page_markdwon( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, typo, markdown
@@ -261,13 +232,15 @@ def post_process_multi_level_headings(self, json_data, md_content): | |||
new_title = title_level + title_text | |||
md_content = re.sub(re.escape(old_title), new_title, md_content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉这里可以不替换,保留title, text, table等元素列表,title_list再保存一下在列表里的index,处理title_list的时候找到index直接赋值,最后再concat。
new_title = title_level + title_text | ||
md_content = re.sub(re.escape(old_title), new_title, md_content) | ||
new_title_list.append(new_title) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉可以加一个log,把old_titles,转成了新的titles,也便于以后debug
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
No description provided.