Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

修复单元格占多行多列,导出表格标注报错的问题 #119

Merged
merged 5 commits into from
Nov 26, 2024

Conversation

BotAndyGao
Copy link
Contributor

修改单元格占多行多列时导出报错的问题

可复现软件版本

  • PPOCRLabel v2.1.11

可复现问题资源

test

Bug问题:当一个单元格占多行多列时,导出表格标注时将原始html转换为标签格式时报错

bug_info

  • 错误原因
    当一个单元格占多行多列时,该单元格所处的行列表信息是 ['td', ' colspan=2 rowspan=2', None, 'td']
    代码处理' colspan=2 rowspan=2' 元素时报错。
  • 报错代码
    截取数值内容不对,int类型转化时报错
 _, n = col.split("colspan=") # **截取时错误**
 token_list.append(' colspan="{}"'.format(int(n))) # **int类型转化时报错**

Bug修复

  • 解决方案
    • 需要从 col 中正确地提取 colspan 和 rowspan,而不是直接用 split 进行拆分。可以通过正则表达式或者更细致的字符串处理来确保正确提取每个属性的值。
    • 在拆分字符串时,先清理多余的空格和顺序,确保属性值是独立的。
  • 关键代码
# Use regex to match "colspan" and "rowspan" attributes and their values
colspan_match = re.search(r"colspan=(\d+)", col)
rowspan_match = re.search(r"rowspan=(\d+)", col)
if colspan_match:
     token_list.append(f' colspan="{colspan_match.group(1)}"')
if rowspan_match:
     token_list.append(f' rowspan="{rowspan_match.group(1)}"')

Bug验证

使用问题图片重新导出正常。

{"filename": "test.png", "html": {"structure": {"tokens": ["<tbody>", "<tr>", "<td>", "</td>", "<td", " colspan=\"2\"", " rowspan=\"2\"", ">", "</td>", "<td>", "</td>", "</tr>", "<tr>", "<td>", "</td>", "<td>", "</td>", "</tr>", "<tr>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "</tr>", "</tbody>"]}, "cells": [{"tokens": ["1", "2", "3"], "bbox": [[11, 7], [57, 7], [57, 30], [11, 30]]}, {"tokens": ["测", "试"], "bbox": [[183, 5], [237, 5], [237, 30], [183, 30]]}, {"tokens": ["5", "5", "4"], "bbox": [[530, 7], [576, 7], [576, 30], [530, 30]]}, {"tokens": ["5", "8", "5"], "bbox": [[11, 35], [57, 35], [57, 58], [11, 58]]}, {"tokens": ["5", "5", "4", "5"], "bbox": [[531, 35], [584, 35], [584, 57], [531, 57]]}, {"tokens": ["1", "2", "3"], "bbox": [[11, 61], [56, 61], [56, 85], [11, 85]]}, {"tokens": ["4", "5", "5"], "bbox": [[184, 61], [229, 61], [229, 84], [184, 84]]}, {"tokens": ["7", "7", "8"], "bbox": [[357, 61], [402, 61], [402, 84], [357, 84]]}, {"tokens": ["5", "6", "6"], "bbox": [[531, 61], [576, 61], [576, 85], [531, 85]]}]}, "gt": "<html><body><table><tbody><tr><td>123</td><td colspan=2 rowspan=2>测试</td><td>554</td></tr><tr><td>585</td><td>5545</td></tr><tr><td>123</td><td>455</td><td>778</td><td>566</td></tr></tbody></table></body></html>"}

Copy link
Collaborator

@GreatV GreatV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GreatV GreatV merged commit 72bf34f into PFCCLab:main Nov 26, 2024
1 check passed
@GreatV
Copy link
Collaborator

GreatV commented Dec 4, 2024

Hi @BotAndyGao,

Thank you so much for your amazing open-source contribution! Your work on PFCCLab/PPOCRLabel has been incredibly helpful, and we really appreciate your efforts in advancing the community.

I’d like to invite you to join our PaddleOCR Open Source Community on WeChat. It’s a group of developers and researchers collaborating to improve PaddleOCR, and your insights would be highly valuable.

If you're interested, please add me on WeChat at wx22wx and kindly mention your GitHub ID when adding me. I’ll send you an invite to the group!

Thanks again, and looking forward to connecting!

Best,
Wang Xin (GreatV)


Hi @BotAndyGao

非常感谢你对开源社区的杰出贡献!你在 PFCCLab/PPOCRLabel 上的工作对我们帮助巨大,我们非常感激你在推动社区发展方面付出的努力。

我想邀请你加入我们的 PaddleOCR 开源共建群。这是一个由开发者和研究人员组成的群体,大家共同致力于推动 PaddleOCR 的优化与提升,我们非常期待你的加入和分享。

如果你有兴趣加入,请添加我的微信 wx22wx,并在加好友时备注你的 GitHub ID,我会及时邀请你进入群组!

再次感谢,期待与你的进一步交流!

祝好,
汪昕 (GreatV)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants