Skip to content

Commit 6e56d1a

Browse files
committed
better docs
1 parent 4b5d31d commit 6e56d1a

File tree

5 files changed

+14145
-26
lines changed

5 files changed

+14145
-26
lines changed

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.ipynb linguist-detectable=false

README.md

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,23 +7,23 @@
77

88
**tables**!
99

10-
There are many pdfs out there, and many of those pdfs have tables. But despite a plethora of table extraction options, there is still no consensus on a definitive extraction method.
10+
There are many pdfs out there, and many of those pdfs have tables. But despite a plethora of table extraction options, there is still no definitive extraction method.
1111

1212
# About gmft
1313

14-
gmft is a high-throughput toolkit for converting pdf tables to many formats, including cropped image, text + positions, plaintext, csv, and pandas dataframes.
14+
gmft is a lightweight, performant, high-throughput toolkit for converting pdf tables to many formats, including cropped image, text + positions, plaintext, csv, and pandas dataframes.
1515

1616
gmft aims to "just work", offering strong performance with the default settings.
1717

18-
gmft relies on microsoft's [Table Transformers](https://github.com/microsoft/table-transformer), which qualitatively is the most performant and reliable of many tested alternatives. See the comparison here.
18+
gmft relies on microsoft's [Table Transformers](https://github.com/microsoft/table-transformer), which qualitatively is the most performant and reliable of many tested alternatives. See the comparison [here](https://docs.google.com/spreadsheets/d/e/2PACX-1vSpMUb4oV7d3UwRrThKbjjfmoorjWhTm620BcX5dhQqo7MRaXmK04y8mH_hImw7JZs-NDzHui7jhAvN/pubhtml?gid=0&single=true).
1919

2020
Install: `pip install gmft`
2121

22-
Quickstart: [demo notebook](https;//github.com/conjuncts/gmft/blob/main/notebooks/demo.ipynb)
22+
Quickstart: [demo notebook](https;//github.com/conjuncts/gmft/blob/main/notebooks/demo.ipynb), [bulk extract](https://github.com/conjuncts/gmft/blob/main/notebooks/bulk_extract.ipynb).
2323

2424
# Why use gmft?
2525

26-
**TL;DR:** gmft is convenient, fast, lightweight, configurable, and gives great results. Check out the [demo notebook](https;//github.com/conjuncts/gmft/blob/main/notebooks/demo.ipynb) for the approximate extraction quality.
26+
**TL;DR:** gmft is convenient, fast, lightweight, configurable, and gives great results. Check out the [bulk extract](https://github.com/conjuncts/gmft/blob/main/notebooks/bulk_extract.ipynb) notebook for approximate extraction quality.
2727

2828
## Many Formats
2929

@@ -47,13 +47,14 @@ Because of the relatively few dependencies and high throughput, gmft is very lig
4747

4848
### High throughput
4949

50-
In most cases, OCR is not necessary; pdfs already contain text positional data. Using this existing data drastically speeds up inference. With that being said, gmft can still extract tables from images and scanned pdfs through the image output.
50+
Benchmark using Colab's **cpu** indicates an approximate rate of ~1.381 s/page; converting to df takes ~1.168 s/table. See the comparison here. This makes gmft about **10x faster** than alternatives like unstructured, nougat, and open-parse/unitable on cpu. ([src](https://docs.google.com/spreadsheets/d/e/2PACX-1vSpMUb4oV7d3UwRrThKbjjfmoorjWhTm620BcX5dhQqo7MRaXmK04y8mH_hImw7JZs-NDzHui7jhAvN/pubhtml?gid=0&single=true), [calculations](https://docs.google.com/spreadsheets/d/e/2PACX-1vSpMUb4oV7d3UwRrThKbjjfmoorjWhTm620BcX5dhQqo7MRaXmK04y8mH_hImw7JZs-NDzHui7jhAvN/pubhtml?gid=39227585&single=true)) How?
5151

52-
Benchmark using Colab's **cpu** indicates an approximate rate of ~1.381 s/page; converting to df takes ~0.945 s/table. See the comparison here.
52+
- gmft focuses on table extraction, so figures, titles, sections, etc. are not extracted.
53+
- In most cases, OCR is not necessary; pdfs already contain text positional data. Using this existing data drastically speeds up inference. With that being said, gmft can still extract tables from images and scanned pdfs through the image output.
54+
- PyPDFium2 is chosen for its [high throughput](https://github.com/py-pdf/benchmarks) and permissive license.
55+
- The base model, tatr is blazing fast.
5356

54-
Gmft focuses on table extraction, so figures, titles, sections, etc. are not extracted.
5557

56-
PyPDFium2 is chosen for its [high throughput](https://github.com/py-pdf/benchmarks) and permissive license.
5758

5859
### Few dependencies
5960

@@ -75,7 +76,9 @@ gmft uses Microsoft's TATR, which is trained on a diverse dataset, PubTables-1M.
7576

7677
The authors are confident that the extraction quality is unmatched. When the model fails, it is usually an OCR issue, merged cell, or false positive. Even in these cases, the text is still highly useable. **Alignment of a value to its row/column header tends to be very accurate** because of the underlying maximization algorithm.
7778

78-
We acknowledge UniTable, a newer model which achieves SOTA results in many datasets like PubLayNet and FinTabNet. Though we plan to support Unitable in the future, Unitable is much larger (~1.5 GB), taking almost 2 orders of magnitude (about x90) longer to run on cpu. Therefore, TATR is still used for its higher throughput. In addition, experimentation does not necessarily show a strict improvement in quality. Contrary to gmft, Unitable may fail first through misalignment because of misplaced html tags (see example.) This may impact use cases where alignment is critical.
79+
We acknowledge UniTable, a newer model which achieves SOTA results in many datasets like PubLayNet and FinTabNet. Though we plan to support Unitable in the future, Unitable is much larger (~1.5 GB), taking almost 2 orders of magnitude (about x90) longer to run on cpu. Therefore, TATR is still used for its higher throughput. In addition, experimentation does not necessarily show a strict improvement in quality. Contrary to gmft, Unitable may fail first through misalignment because of misplaced html tags. This may impact use cases where alignment is critical.
80+
81+
We invite the reader to explore the [comparison notebooks](https://drive.google.com/drive/u/0/folders/114bWRj5H4aE-BA5UKH9S5ol8LC6vhqfR) to survey your own use cases and compare results.
7982

8083
# Limitations
8184

@@ -85,8 +88,6 @@ Multi-indices (multiple column headers) are not yet supported.
8588

8689
Slightly rotated tables will probably fail, especially large tables that are not perfectly level.
8790

88-
89-
9091
# Acknowledgements
9192

9293
A tremendous thank you to the TATR authors: Brandon Smock, Rohith Pesala, and Robin Abraham, for making gmft possible. The image->csv step is highly inspired by TATR's inference.py code, but has been rewritten for performance.
@@ -97,6 +98,8 @@ Thank you to Niels Rogge for porting TATR to huggingface and writing the [visual
9798

9899
Gmft focuses highly on pdf tables. For more general document understanding, I recommend checking out [open-parse](https://github.com/Filimoa/open-parse), [unstructured](https://github.com/Unstructured-IO/unstructured), [surya](https://github.com/VikParuchuri/surya), [deepdoctection](https://github.com/deepdoctection/deepdoctection), and [DocTR](https://github.com/mindee/doctr).
99100

100-
In particular, open-parse and unstructured also do quite well on the same example pdfs in terms of extraction quality. Open-parse offers Unitable, a larger model which may achieve higher quality but runs much slower on cpu (see [reliability section](#Reliable) for more discussion.) Importantly, open-parse allows extraction of auxiliary information paragraphs, etc., (not just tables) useful for RAG.
101+
Nougat is excellent in outputting full mathpix markdown (.mmd), which includes latex formulas, bold/italics, and fully latex-typeset tables.
102+
103+
Open-parse and unstructured also do quite well on the same example pdfs in terms of extraction quality. Open-parse offers Unitable, a larger model which may achieve higher quality but runs much slower on cpu (see [reliability section](#Reliable) for more discussion.) Importantly, open-parse allows extraction of auxiliary information paragraphs, etc., (not just tables) useful for RAG.
101104

102105
gmft is released under MIT.

notebooks/quickstart.ipynb

Lines changed: 14127 additions & 0 deletions
Large diffs are not rendered by default.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,4 +33,4 @@ requires = ["flit_core >=3.4,<4"]
3333

3434
[tool.flit.sdist]
3535
include = ["gmft/**/*.py", "README.md", "LICENSE"]
36-
exclude = ["gmft/pdf_bindings_mu.py", "dist/**/*", "notebooks/**/*", "samples/**/*", "reading_list.md", "*.zip"]
36+
exclude = ["gmft/pdf_bindings_mu.py", "dist/**/*", "notebooks/**/*", "samples/**/*", "plans/**/*", "reading_list.md", "*.zip"]

reading_list.md

Lines changed: 0 additions & 12 deletions
This file was deleted.

0 commit comments

Comments
 (0)