Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix markdown reader image path #334

Open
wants to merge 4 commits into
base: feature
Choose a base branch
from

Conversation

Ceceliachenen
Copy link
Collaborator

No description provided.

@Ceceliachenen Ceceliachenen requested a review from moria97 January 7, 2025 08:40
Copy link

github-actions bot commented Jan 7, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
8633 4256 49% 40% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
src/pai_rag/integrations/readers/pai_html_reader.py 89% 🟢
src/pai_rag/integrations/readers/pai_markdown_reader.py 28% 🟢
TOTAL 59% 🟢

updated for commit: fd90c57 by action🐍

@moria97 moria97 changed the title fix_markdown_reader_image_path Fix markdown reader image path Jan 7, 2025
@@ -55,9 +58,12 @@ def replace_image_paths(self, markdown_name: str, content: str):
for match in html_image_matches:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

突然想到,html文件里可能也要做这个处理

@@ -55,9 +58,12 @@ def replace_image_paths(self, markdown_name: str, content: str):
for match in html_image_matches:
full_match = match.group(0) # 整个匹配
local_url = match.group(1) # 捕获的URL
image_name = os.path.basename(local_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image有可能会有上层目录,比如"figures/docs/1.jpg"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议这样写

def is_url(url):
    """判断是否为 URL"""
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

base_dir = os.path.basedir(markdown_path)
if not is_url(image_path):
    image_path = os.path.join(base_dir, image_path) #绝对路径不会被合并

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image在上传的时候,上层目录没有被保留

else:
content = content.replace(f"![{alt_text}]({image_url})", "")
content = content.replace(full_match, "")
for match in html_image_matches:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段和上面165-178行看起来一模一样?可以写到一起吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants