Skip to content

Commit

Permalink
Merge pull request #24 from RapidAI/wired_table_optim
Browse files Browse the repository at this point in the history
Wired table optim
  • Loading branch information
SWHL authored Sep 18, 2024
2 parents 37b6b76 + ae3f873 commit 594b4b6
Show file tree
Hide file tree
Showing 19 changed files with 2,029 additions and 335 deletions.
74 changes: 37 additions & 37 deletions .github/workflows/lineless_table_rec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,40 +35,40 @@ jobs:
pytest tests/test_lineless_table_rec.py
GenerateWHL_PushPyPi:
needs: UnitTesting
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up Python 3.7
uses: actions/setup-python@v4
with:
python-version: '3.7'
architecture: 'x64'

- name: Run setup.py
run: |
pip install -r requirements.txt
python -m pip install --upgrade pip
pip install wheel get_pypi_latest_version
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/lineless_table_rec_models.zip
unzip lineless_table_rec_models.zip
mv lineless_table_rec_models/*.onnx lineless_table_rec/models/
python setup_lineless.py bdist_wheel "${{ github.event.head_commit.message }}"
# - name: Publish distribution 📦 to Test PyPI
# uses: pypa/[email protected]
# with:
# password: ${{ secrets.TEST_PYPI_API_TOKEN }}
# repository_url: https://test.pypi.org/legacy/
# packages_dir: dist/

- name: Publish distribution 📦 to PyPI
uses: pypa/[email protected]
with:
password: ${{ secrets.PYPI_API_TOKEN }}
packages_dir: dist/
# GenerateWHL_PushPyPi:
# needs: UnitTesting
# runs-on: ubuntu-latest
#
# steps:
# - uses: actions/checkout@v3
#
# - name: Set up Python 3.7
# uses: actions/setup-python@v4
# with:
# python-version: '3.7'
# architecture: 'x64'
#
# - name: Run setup.py
# run: |
# pip install -r requirements.txt
# python -m pip install --upgrade pip
# pip install wheel get_pypi_latest_version
#
# wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/lineless_table_rec_models.zip
# unzip lineless_table_rec_models.zip
# mv lineless_table_rec_models/*.onnx lineless_table_rec/models/
#
# python setup_lineless.py bdist_wheel "${{ github.event.head_commit.message }}"
#
# # - name: Publish distribution 📦 to Test PyPI
# # uses: pypa/[email protected]
# # with:
# # password: ${{ secrets.TEST_PYPI_API_TOKEN }}
# # repository_url: https://test.pypi.org/legacy/
# # packages_dir: dist/
#
# - name: Publish distribution 📦 to PyPI
# uses: pypa/[email protected]
# with:
# password: ${{ secrets.PYPI_API_TOKEN }}
# packages_dir: dist/
60 changes: 30 additions & 30 deletions .github/workflows/table_cls.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,33 +35,33 @@ jobs:
pytest tests/test_table_cls.py
GenerateWHL_PushPyPi:
needs: UnitTesting
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: '3.10'
architecture: 'x64'

- name: Run setup.py
run: |
pip install -r requirements.txt
python -m pip install --upgrade pip
pip install wheel get_pypi_latest_version
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/table_cls_models.zip
unzip table_cls_models.zip
mv table_cls_models/*.onnx table_cls/models/
python setup_table_cls.py bdist_wheel "${{ github.event.head_commit.message }}"
- name: Publish distribution 📦 to PyPI
uses: pypa/[email protected]
with:
password: ${{ secrets.TABLE_CLS }}
packages_dir: dist/
# GenerateWHL_PushPyPi:
# needs: UnitTesting
# runs-on: ubuntu-latest
#
# steps:
# - uses: actions/checkout@v3
#
# - name: Set up Python 3.10
# uses: actions/setup-python@v4
# with:
# python-version: '3.10'
# architecture: 'x64'
#
# - name: Run setup.py
# run: |
# pip install -r requirements.txt
# python -m pip install --upgrade pip
# pip install wheel get_pypi_latest_version
#
# wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/table_cls_models.zip
# unzip table_cls_models.zip
# mv table_cls_models/*.onnx table_cls/models/
#
# python setup_table_cls.py bdist_wheel "${{ github.event.head_commit.message }}"
#
# - name: Publish distribution 📦 to PyPI
# uses: pypa/[email protected]
# with:
# password: ${{ secrets.TABLE_CLS }}
# packages_dir: dist/
60 changes: 30 additions & 30 deletions .github/workflows/wired_table_rec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,33 +35,33 @@ jobs:
pytest tests/test_wired_table_rec.py
GenerateWHL_PushPyPi:
needs: UnitTesting
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up Python 3.7
uses: actions/setup-python@v4
with:
python-version: '3.7'
architecture: 'x64'

- name: Run setup.py
run: |
pip install -r requirements.txt
python -m pip install --upgrade pip
pip install wheel get_pypi_latest_version
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/wired_table_rec_models.zip
unzip wired_table_rec_models.zip
mv wired_table_rec_models/*.onnx wired_table_rec/models/
python setup_wired.py bdist_wheel "${{ github.event.head_commit.message }}"
- name: Publish distribution 📦 to PyPI
uses: pypa/[email protected]
with:
password: ${{ secrets.PYPI_API_TOKEN }}
packages_dir: dist/
# GenerateWHL_PushPyPi:
# needs: UnitTesting
# runs-on: ubuntu-latest
#
# steps:
# - uses: actions/checkout@v3
#
# - name: Set up Python 3.7
# uses: actions/setup-python@v4
# with:
# python-version: '3.7'
# architecture: 'x64'
#
# - name: Run setup.py
# run: |
# pip install -r requirements.txt
# python -m pip install --upgrade pip
# pip install wheel get_pypi_latest_version
#
# wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/wired_table_rec_models.zip
# unzip wired_table_rec_models.zip
# mv wired_table_rec_models/*.onnx wired_table_rec/models/
#
# python setup_wired.py bdist_wheel "${{ github.event.head_commit.message }}"
#
# - name: Publish distribution 📦 to PyPI
# uses: pypa/[email protected]
# with:
# password: ${{ secrets.PYPI_API_TOKEN }}
# packages_dir: dist/
128 changes: 96 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<div align="center">
<div align="center">
<h1><b>📊 Table Structure Recognition</b></h1>
<h1><b>📊 表格结构识别</b></h1>
</div>
<a href=""><img src="https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg"></a>
<a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Mac%2C%20Win-pink.svg"></a>
Expand All @@ -10,61 +10,125 @@
<a href="https://semver.org/"><img alt="SemVer2.0" src="https://img.shields.io/badge/SemVer-2.0-brightgreen"></a>
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
<a href="https://github.com/RapidAI/TableStructureRec/blob/c41bbd23898cb27a957ed962b0ffee3c74dfeff1/LICENSE"><img alt="GitHub" src="https://img.shields.io/badge/license-Apache 2.0-blue"></a>
</div>

### 简介

💖该仓库是用来对文档中表格做结构化识别的推理库,包括来自paddle的表格识别模型,
阿里读光有线和无线表格识别模型,llaipython(微信)贡献的有线表格模型,网易Qanything内置表格分类模型等。

#### 特点
**** 采用ONNXRuntime作为推理引擎,cpu下单图推理1-7s

🎯 ****: 结合表格类型分类模型,区分有线表格,无线表格,任务更细分,精度更高

[简体中文](./docs/README_zh.md) | English
🛡️ ****: 不依赖任何第三方训练框架,只依赖必要基础库,避免包冲突

### 效果展示
<div align="center">
<img src="https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/demo_img_output.gif" alt="Demo" width="100%" height="100%">
</div>

### Introduction
### 指标结果
[TableRecognitionMetric 评测工具](https://github.com/SWHL/TableRecognitionMetric) [评测数据集](https://huggingface.co/datasets/SWHL/table_rec_test_dataset) [Rapid OCR](https://github.com/RapidAI/RapidOCR)

This repo is an inference library used for structured recognition of tables in documents, including table structure recognition algorithm models from PaddleOCR, wired and wireless table recognition algorithm models from Alibaba Duguang, etc.
| 方法 | TEDS |
|:---------------------------------------------------------------------------------------------------------------------------|:----:|
| lineless_table_rec | 0.53561 |
| [RapidTable](https://github.com/RapidAI/RapidStructure/blob/b800b156015bf5cd6f5429295cdf48be682fd97e/docs/README_Table.md) | 0.58786 |
| wired_table_rec v1 | 0.70279 |
| wired_table_rec v2 | 0.78007 |
| table_cls + wired_table_rec v1 + lineless_table_rec | 0.74692 |
| table_cls + wired_table_rec v2 + lineless_table_rec |0.80235|

The repo has improved the pre- and post-processing of form recognition and combined with OCR to ensure that the form recognition part can be used directly.
### 安装
``` python {linenos=table}
pip install wired_table_rec lineless_table_rec table_cls
```

The repo will continue to focus on the field of table recognition, integrate the latest and most useful table recognition algorithms, and strive to create the most valuable table recognition tool library.
### 快速使用
``` python {linenos=table}
import os

Welcome everyone to continue to pay attention.
from lineless_table_rec import LinelessTableRecognition
from lineless_table_rec.utils_table_recover import format_html, plot_rec_box_with_logic_info, plot_rec_box
from table_cls import TableCls
from wired_table_rec import WiredTableRecognition

### What is Table Structure Recognition?
lineless_engine = LinelessTableRecognition()
wired_engine = WiredTableRecognition()
table_cls = TableCls()
img_path = f'images/img14.jpg'

Table Structure Recognition (TSR) aims to extract the logical or physical structure of table images, thereby converting unstructured table images into machine-readable formats.
cls,elasp = table_cls(img_path)
if cls == 'wired':
table_engine = wired_engine
else:
table_engine = lineless_engine
html, elasp, polygons, logic_points, ocr_res = table_engine(img_path)
print(f"elasp: {elasp}")

Logical structure: represents the row/column relationship of cells (such as the same row, the same column) and the span information of cells.
# output_dir = f'outputs'
# complete_html = format_html(html)
# os.makedirs(os.path.dirname(f"{output_dir}/table.html"), exist_ok=True)
# with open(f"{output_dir}/table.html", "w", encoding="utf-8") as file:
# file.write(complete_html)
# # 可视化表格识别框 + 逻辑行列信息
# plot_rec_box_with_logic_info(
# img_path, f"{output_dir}/table_rec_box.jpg", logic_points, polygons
# )
# # 可视化 ocr 识别框
# plot_rec_box(img_path, f"{output_dir}/ocr_box.jpg", ocr_res)
```

Physical structure: includes not only the logical structure, but also the cell's bounding box, content and other information, emphasizing the physical location of the cell.
## FAQ (Frequently Asked Questions)

<div align='center'>
<img src="https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/TSRFramework.jpg" width=70%>
</div>
1. **问:偏移的图片能够处理吗?**
- 答:该项目暂时不支持偏移图片识别,请先修正图片,也欢迎提pr来解决这个问题。

Figure from: [Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling](https://openaccess.thecvf.com/content/CVPR2023/html/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper.html)
2. **问:识别框丢失了内部文字信息**
- 答:默认使用的rapidocr小模型,如果需要更高精度的效果,可以从 [模型列表](https://rapidai.github.io/RapidOCRDocs/model_list/#_1)
下载更高精度的ocr模型,在执行时传入ocr_result即可

3. **问:模型支持 gpu 加速吗?**
- 答:目前表格模型的推理非常快,有线表格在100ms级别,无线表格在500ms级别,
主要耗时在ocr阶段,可以参考 [rapidocr_paddle](https://rapidai.github.io/RapidOCRDocs/install_usage/rapidocr_paddle/usage/#_3) 加速ocr识别过程

### Documentation
### TODO List
- [ ] 识别前图片偏移修正
- [ ] 增加数据集数量,增加更多评测对比
- [ ] 优化无线表格模型

Full documentation can be found on [docs](https://rapidai.github.io/TableStructureRec/docs/), in Chinese.
### 处理流程
```mermaid
flowchart TD
A[/表格图片/] --> B([表格分类])
B --> C([有线表格识别]) & D([无线表格识别]) --> E([文字识别 rapidocr_onnxruntime])
E --> F[/html结构化输出/]
```

### Acknowledgements
### 致谢

[PaddleOCR Table](https://github.com/PaddlePaddle/PaddleOCR/blob/4b17511491adcfd0f3e2970895d06814d1ce56cc/ppstructure/table/README_ch.md)
[PaddleOCR 表格识别](https://github.com/PaddlePaddle/PaddleOCR/blob/4b17511491adcfd0f3e2970895d06814d1ce56cc/ppstructure/table/README_ch.md)

[Cycle CenterNet](https://www.modelscope.cn/models/damo/cv_dla34_table-structure-recognition_cycle-centernet/summary)
[读光-表格结构识别-有线表格](https://www.modelscope.cn/models/damo/cv_dla34_table-structure-recognition_cycle-centernet/summary)

[LORE](https://www.modelscope.cn/models/damo/cv_resnet-transformer_table-structure-recognition_lore/summary)
[读光-表格结构识别-无线表格](https://www.modelscope.cn/models/damo/cv_resnet-transformer_table-structure-recognition_lore/summary)

### Contributing
[Qanything-RAG](https://github.com/netease-youdao/QAnything)

Pull requests are welcome. For major changes, please open an issue first
to discuss what you would like to change.
非常感谢 llaipython(微信,提供全套有偿高精度表格提取) 提供高精度有线表格模型。

Please make sure to update tests as appropriate.
### 贡献指南

### [Sponsor](https://rapidai.github.io/Knowledge-QA-LLM/docs/sponsor/)
欢迎提交请求。对于重大更改,请先打开issue讨论您想要改变的内容。

If you want to sponsor the project, you can directly click the **Buy me a coffee** image, please write a note (e.g. your github account name) to facilitate adding to the sponsorship list below.
请确保适当更新测试。

<div align="left">
<a href="https://www.buymeacoffee.com/SWHL"><img src="https://raw.githubusercontent.com/RapidAI/.github/main/assets/buymeacoffe.png" width="30%" height="30%"></a>
</div>
### [赞助](https://rapidai.github.io/Knowledge-QA-LLM/docs/sponsor/)

如果您想要赞助该项目,可直接点击当前页最上面的Sponsor按钮,请写好备注(**您的Github账号名称**),方便添加到赞助列表中。

### License
### 开源许可证

This project is released under the [Apache 2.0 license](https://github.com/RapidAI/TableStructureRec/blob/c41bbd23898cb27a957ed962b0ffee3c74dfeff1/LICENSE).
该项目采用[Apache 2.0](https://github.com/RapidAI/TableStructureRec/blob/c41bbd23898cb27a957ed962b0ffee3c74dfeff1/LICENSE)开源许可证。
Loading

0 comments on commit 594b4b6

Please sign in to comment.