修复单元格占多行多列，导出表格标注报错的问题 #119

BotAndyGao · 2024-11-26T08:34:57Z

修改单元格占多行多列时导出报错的问题

可复现软件版本

PPOCRLabel v2.1.11

可复现问题资源

Bug问题：当一个单元格占多行多列时，导出表格标注时将原始html转换为标签格式时报错

错误原因
当一个单元格占多行多列时，该单元格所处的行列表信息是 ['td', ' colspan=2 rowspan=2', None, 'td']
代码处理' colspan=2 rowspan=2' 元素时报错。
报错代码
截取数值内容不对，int类型转化时报错

 _, n = col.split("colspan=") # **截取时错误**
 token_list.append(' colspan="{}"'.format(int(n))) # **int类型转化时报错**

Bug修复

解决方案
- 需要从 col 中正确地提取 colspan 和 rowspan，而不是直接用 split 进行拆分。可以通过正则表达式或者更细致的字符串处理来确保正确提取每个属性的值。
- 在拆分字符串时，先清理多余的空格和顺序，确保属性值是独立的。
关键代码

# Use regex to match "colspan" and "rowspan" attributes and their values
colspan_match = re.search(r"colspan=(\d+)", col)
rowspan_match = re.search(r"rowspan=(\d+)", col)
if colspan_match:
     token_list.append(f' colspan="{colspan_match.group(1)}"')
if rowspan_match:
     token_list.append(f' rowspan="{rowspan_match.group(1)}"')

Bug验证

使用问题图片重新导出正常。

{"filename": "test.png", "html": {"structure": {"tokens": ["<tbody>", "<tr>", "<td>", "</td>", "<td", " colspan=\"2\"", " rowspan=\"2\"", ">", "</td>", "<td>", "</td>", "</tr>", "<tr>", "<td>", "</td>", "<td>", "</td>", "</tr>", "<tr>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "<td>", "</td>", "</tr>", "</tbody>"]}, "cells": [{"tokens": ["1", "2", "3"], "bbox": [[11, 7], [57, 7], [57, 30], [11, 30]]}, {"tokens": ["测", "试"], "bbox": [[183, 5], [237, 5], [237, 30], [183, 30]]}, {"tokens": ["5", "5", "4"], "bbox": [[530, 7], [576, 7], [576, 30], [530, 30]]}, {"tokens": ["5", "8", "5"], "bbox": [[11, 35], [57, 35], [57, 58], [11, 58]]}, {"tokens": ["5", "5", "4", "5"], "bbox": [[531, 35], [584, 35], [584, 57], [531, 57]]}, {"tokens": ["1", "2", "3"], "bbox": [[11, 61], [56, 61], [56, 85], [11, 85]]}, {"tokens": ["4", "5", "5"], "bbox": [[184, 61], [229, 61], [229, 84], [184, 84]]}, {"tokens": ["7", "7", "8"], "bbox": [[357, 61], [402, 61], [402, 84], [357, 84]]}, {"tokens": ["5", "6", "6"], "bbox": [[531, 61], [576, 61], [576, 85], [531, 85]]}]}, "gt": "<html><body><table><tbody><tr><td>123</td><td colspan=2 rowspan=2>测试</td><td>554</td></tr><tr><td>585</td><td>5545</td></tr><tr><td>123</td><td>455</td><td>778</td><td>566</td></tr></tbody></table></body></html>"}

2、解决导出的gt文件中gt属性中html标签合规的问题

GreatV

LGTM

GreatV · 2024-12-04T14:46:16Z

Hi @BotAndyGao,

Thank you so much for your amazing open-source contribution! Your work on PFCCLab/PPOCRLabel has been incredibly helpful, and we really appreciate your efforts in advancing the community.

I’d like to invite you to join our PaddleOCR Open Source Community on WeChat. It’s a group of developers and researchers collaborating to improve PaddleOCR, and your insights would be highly valuable.

If you're interested, please add me on WeChat at wx22wx and kindly mention your GitHub ID when adding me. I’ll send you an invite to the group!

Thanks again, and looking forward to connecting!

Best,
Wang Xin (GreatV)

Hi @BotAndyGao，

非常感谢你对开源社区的杰出贡献！你在 PFCCLab/PPOCRLabel 上的工作对我们帮助巨大，我们非常感激你在推动社区发展方面付出的努力。

我想邀请你加入我们的 PaddleOCR 开源共建群。这是一个由开发者和研究人员组成的群体，大家共同致力于推动 PaddleOCR 的优化与提升，我们非常期待你的加入和分享。

如果你有兴趣加入，请添加我的微信 wx22wx，并在加好友时备注你的 GitHub ID，我会及时邀请你进入群组！

再次感谢，期待与你的进一步交流！

祝好，
汪昕 (GreatV)

BotAndyGao added 5 commits September 13, 2024 14:54

1、解决导出表格标注时添加colspan和rowspan时的异常

9146711

2、解决导出的gt文件中gt属性中html标签合规的问题

fix code style

5dfe1d0

Merge remote-tracking branch 'refs/remotes/upstream/main'

ecadc9e

Merge branch 'PFCCLab:main' into main

9e7ae19

修复单元格占多行又占多列导出报错的问题。issues：导出表格标注报错 PFCCLab#113

8cc4ab6

GreatV approved these changes Nov 26, 2024

View reviewed changes

GreatV merged commit 72bf34f into PFCCLab:main Nov 26, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

修复单元格占多行多列，导出表格标注报错的问题 #119

修复单元格占多行多列，导出表格标注报错的问题 #119

BotAndyGao commented Nov 26, 2024

GreatV left a comment

GreatV commented Dec 4, 2024

修复单元格占多行多列，导出表格标注报错的问题 #119

修复单元格占多行多列，导出表格标注报错的问题 #119

Conversation

BotAndyGao commented Nov 26, 2024

修改单元格占多行多列时导出报错的问题

可复现软件版本

可复现问题资源

Bug问题：当一个单元格占多行多列时，导出表格标注时将原始html转换为标签格式时报错

Bug修复

Bug验证

GreatV left a comment

Choose a reason for hiding this comment

GreatV commented Dec 4, 2024