|
| 1 | +|Deep |Link |Demo |Notebook |Deep?|Reads image?|Detectron?|OCR included?|Seems to work |get pandas df? |get text?|get image?|throughput (cpu)| |
| 2 | +|----------------------------------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|-----|------------|----------|-------------|------------------------------------|-----------------|---------|----------|----------------| |
| 3 | +|nougat |[github](https://github.com/facebookresearch/nougat) | |[Nougat eval](https://colab.research.google.com/drive/1B4agm6hwR-Ia-5AduEU-y7DteNAOxRhX) |✓ |✓ | |✓ |✓✓ |latex table (mmd)|✓ |✗ |~330 s/page | |
| 4 | +|gmft |[github](https://github.com/conjuncts/gmft) | |[gmft eval](https://colab.research.google.com/drive/1fEqsTdKcO5RNPV_b2v9cB4Y5We9Kv-hR) |✓ |✓ | |✗ |✓✓ |✓ |✓ |✓ |~1.867 s/page | |
| 5 | +|img2table |[github](https://github.com/xavctn/img2table) | |[img2table eval](https://colab.research.google.com/drive/1_TD2U0JsaW0SqmuCUv7iSbAyJwvRuq_C) |✗ |✓ | |✓ |✓✓ |✓ |✓ |✓ |~1.45 s/page | |
| 6 | +|unstructured |[docs.unstructured.io](https://docs.unstructured.io/examplecode/codesamples/apioss/table-extraction-from-pdf) | |[Unstructured eval](https://colab.research.google.com/drive/1k8IpVqyCW8DUZ8psRxHPCQSnE3XZBuOd) |✓ |✓ |✓ |✓ |✓ |✓ (html -> df) |✓ |? |~15.35 s/page | |
| 7 | +|open-parse (unitable) |[github](https://github.com/Filimoa/open-parse) |[openparse_quickstart.ipynb](https://colab.research.google.com/drive/1Z5B5gsnmhFKEFL-5yYIcoox7-jQao8Ep) |[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)|✓ |✓ | | |✓ |✓ (html -> df) |✓ |✓ (custom)|~126 s/page | |
| 8 | +|open-parse (tatr) |[github](https://github.com/Filimoa/open-parse) | |[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)|✓ |✓ | | |✓ |✓ (html -> df) |✓ |✓ (custom)|~4.992 s/page | |
| 9 | +|open-parse (pymupdf) |[github](https://github.com/Filimoa/open-parse) | |[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)|✗ |✗ | | |✗ | | |✓ (custom)|~0.67 s/page | |
| 10 | +|deepdoctection, tatr |[github](https://github.com/deepdoctection/deepdoctection) | |[deepdoctection tatr eval](https://colab.research.google.com/drive/19c7uMC0Ya2tfZw1r2itstmuX2wxun86L) |✓ |✓ |✓ |✓ |✗ needs config | | |? |~58s per page | |
| 11 | +|surya |[github](https://github.com/VikParuchuri/surya) | |[surya eval](https://colab.research.google.com/drive/1LUqEIiiGt0EDK3jrypWQJKrrXW3nA9ty?usp=drive_link) |✓ |✓ | |✓ |✓ |✗ |✗ |✓ |~60.679 s/page | |
| 12 | +|paddleocr |[github](https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md) | |https://medium.com/@malshanCS/automating-table-data-extraction-tools-and-techniques-for-efficiency-a29df313cbda#629d |✓ |✓ | | |? | | | | | |
| 13 | +|alibaba/omniparser |[github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/OmniParser) | | |✓ |✓ | | |? | | | | | |
| 14 | +|alibaba/DocXChain |[github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain) | | |✓ |✓ | | |? | | | | | |
| 15 | +|layoutparser (no commit in 2 yrs?)|[github](https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb)|https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb| |✓ |✓ |✓ | |unmaintained | | | | | |
| 16 | +| | | | | | | | | | | | | | |
| 17 | +|doctr (not tbl focused) |[github](https://github.com/mindee/doctr) |https://huggingface.co/spaces/mindee/doctr | |✓ |✓ | | |N/A |N/A | | | | |
| 18 | +| | | | | | | | | | | | | | |
| 19 | +|Non-deep | | | | | | | | | | | | | |
| 20 | +|camelot |[github](https://github.com/camelot-dev/camelot) | |[camelot eval](https://colab.research.google.com/drive/1ORQPURWJuLvTOeboU0-t4Xg9t6iqTIPO) |✗ | | | |✓ many false positives, needs config|✓ |✓ |possible |~1.82 s/page | |
| 21 | +|pdfplumber |[github](https://github.com/jsvine/pdfplumber) | |[pdfplumber eval](https://colab.research.google.com/drive/1DUmd_Sjzhp4ZrltxvXV0-F3fiBQhE8a6) |✗ | | | |✗ or needs config | | |possible |~0.273 s/page | |
| 22 | +|pymupdf |[github](https://github.com/pymupdf/PyMuPDF) | |[pymupdf eval](https://colab.research.google.com/drive/1ZBrAwrfOgDewXhyfDl5xN7mbGUM4idhW) |✗ | | | |✗ or needs config | | |possible |~0.250 s/page | |
| 23 | +|pdfminer |[github](https://github.com/pdfminer/pdfminer.six) | | |✗ | | | | | | | | | |
| 24 | +|Proprietary | | | | | | | | | | | | | |
| 25 | +|mathpix | | | |✓ | | | |✓ | | | | | |
| 26 | +|Adobe Sensei |[developer.adobe.com](https://developer.adobe.com/document-services/apis/pdf-extract/) | | |✓ | | | |✓ | | | | | |
| 27 | +|AWS TextExtract | | | |✓ | | | |✓ | | | | | |
| 28 | +|Azure Document Intelligence |[azure.microsoft.com](https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/) | | |✓ | | | |✓ | | | | | |
| 29 | +|Google Document AI |[cloud.google.com](https://cloud.google.com/document-ai?hl=en) | | |✓ | | | |✓ | | | | | |
0 commit comments