add matplotlib

conjuncts · conjuncts · commit 7f445fac3546 · 2024-06-11T09:12:18.000-05:00
diff --git a/comparison.md b/comparison.md
@@ -0,0 +1,29 @@
+|Deep                              |Link                                                                                                                       |Demo                                                                                                             |Notebook                                                                                                               |Deep?|Reads image?|Detectron?|OCR included?|Seems to work                       |get pandas df?   |get text?|get image?|throughput (cpu)|
+|----------------------------------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|-----|------------|----------|-------------|------------------------------------|-----------------|---------|----------|----------------|
+|nougat                            |[github](https://github.com/facebookresearch/nougat)                                                                       |                                                                                                                 |[Nougat eval](https://colab.research.google.com/drive/1B4agm6hwR-Ia-5AduEU-y7DteNAOxRhX)                               |✓    |✓           |          |✓            |✓✓                                  |latex table (mmd)|✓        |✗         |~330 s/page     |
+|gmft                              |[github](https://github.com/conjuncts/gmft)                                                                                |                                                                                                                 |[gmft eval](https://colab.research.google.com/drive/1fEqsTdKcO5RNPV_b2v9cB4Y5We9Kv-hR)                                 |✓    |✓           |          |✗            |✓✓                                  |✓                |✓        |✓         |~1.867 s/page   |
+|img2table                         |[github](https://github.com/xavctn/img2table)                                                                              |                                                                                                                 |[img2table eval](https://colab.research.google.com/drive/1_TD2U0JsaW0SqmuCUv7iSbAyJwvRuq_C)                            |✗    |✓           |          |✓            |✓✓                                  |✓                |✓        |✓         |~1.45 s/page    |
+|unstructured                      |[docs.unstructured.io](https://docs.unstructured.io/examplecode/codesamples/apioss/table-extraction-from-pdf)              |                                                                                                                 |[Unstructured eval](https://colab.research.google.com/drive/1k8IpVqyCW8DUZ8psRxHPCQSnE3XZBuOd)                         |✓    |✓           |✓         |✓            |✓                                   |✓ (html -> df)   |✓        |?         |~15.35 s/page   |
+|open-parse (unitable)             |[github](https://github.com/Filimoa/open-parse)                                                                            |[openparse_quickstart.ipynb](https://colab.research.google.com/drive/1Z5B5gsnmhFKEFL-5yYIcoox7-jQao8Ep)          |[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)|✓    |✓           |          |             |✓                                   |✓ (html -> df)   |✓        |✓ (custom)|~126 s/page     |
+|open-parse (tatr)                 |[github](https://github.com/Filimoa/open-parse)                                                                            |                                                                                                                 |[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)|✓    |✓           |          |             |✓                                   |✓ (html -> df)   |✓        |✓ (custom)|~4.992 s/page   |
+|open-parse (pymupdf)              |[github](https://github.com/Filimoa/open-parse)                                                                            |                                                                                                                 |[open-parse eval](https://colab.research.google.com/drive/18r_0vfxbD-RsCqIcQE3Lo_nF4Lsh_z2s?ouid=110924231912857331758)|✗    |✗           |          |             |✗                                   |                 |         |✓ (custom)|~0.67 s/page    |
+|deepdoctection, tatr              |[github](https://github.com/deepdoctection/deepdoctection)                                                                 |                                                                                                                 |[deepdoctection tatr eval](https://colab.research.google.com/drive/19c7uMC0Ya2tfZw1r2itstmuX2wxun86L)                  |✓    |✓           |✓         |✓            |✗ needs config                      |                 |         |?         |~58s per page   |
+|surya                             |[github](https://github.com/VikParuchuri/surya)                                                                            |                                                                                                                 |[surya eval](https://colab.research.google.com/drive/1LUqEIiiGt0EDK3jrypWQJKrrXW3nA9ty?usp=drive_link)                 |✓    |✓           |          |✓            |✓                                   |✗                |✗        |✓         |~60.679 s/page  |
+|paddleocr                         |[github](https://github.com/PaddlePaddle/PaddleOCR/blob/main/README_en.md)                                                 |                                                                                                                 |https://medium.com/@malshanCS/automating-table-data-extraction-tools-and-techniques-for-efficiency-a29df313cbda#629d   |✓    |✓           |          |             |?                                   |                 |         |          |                |
+|alibaba/omniparser                |[github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/OmniParser)                            |                                                                                                                 |                                                                                                                       |✓    |✓           |          |             |?                                   |                 |         |          |                |
+|alibaba/DocXChain                 |[github](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain)                    |                                                                                                                 |                                                                                                                       |✓    |✓           |          |             |?                                   |                 |         |          |                |
+|layoutparser (no commit in 2 yrs?)|[github](https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb)|https://github.com/Layout-Parser/layout-parser/blob/main/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb|                                                                                                                       |✓    |✓           |✓         |             |unmaintained                        |                 |         |          |                |
+|                                  |                                                                                                                           |                                                                                                                 |                                                                                                                       |     |            |          |             |                                    |                 |         |          |                |
+|doctr (not tbl focused)           |[github](https://github.com/mindee/doctr)                                                                                  |https://huggingface.co/spaces/mindee/doctr                                                                       |                                                                                                                       |✓    |✓           |          |             |N/A                                 |N/A              |         |          |                |
+|                                  |                                                                                                                           |                                                                                                                 |                                                                                                                       |     |            |          |             |                                    |                 |         |          |                |
+|Non-deep                          |                                                                                                                           |                                                                                                                 |                                                                                                                       |     |            |          |             |                                    |                 |         |          |                |
+|camelot                           |[github](https://github.com/camelot-dev/camelot)                                                                           |                                                                                                                 |[camelot eval](https://colab.research.google.com/drive/1ORQPURWJuLvTOeboU0-t4Xg9t6iqTIPO)                              |✗    |            |          |             |✓ many false positives, needs config|✓                |✓        |possible  |~1.82 s/page    |
+|pdfplumber                        |[github](https://github.com/jsvine/pdfplumber)                                                                             |                                                                                                                 |[pdfplumber eval](https://colab.research.google.com/drive/1DUmd_Sjzhp4ZrltxvXV0-F3fiBQhE8a6)                           |✗    |            |          |             |✗ or needs config                   |                 |         |possible  |~0.273 s/page   |
+|pymupdf                           |[github](https://github.com/pymupdf/PyMuPDF)                                                                               |                                                                                                                 |[pymupdf eval](https://colab.research.google.com/drive/1ZBrAwrfOgDewXhyfDl5xN7mbGUM4idhW)                              |✗    |            |          |             |✗ or needs config                   |                 |         |possible  |~0.250 s/page   |
+|pdfminer                          |[github](https://github.com/pdfminer/pdfminer.six)                                                                         |                                                                                                                 |                                                                                                                       |✗    |            |          |             |                                    |                 |         |          |                |
+|Proprietary                       |                                                                                                                           |                                                                                                                 |                                                                                                                       |     |            |          |             |                                    |                 |         |          |                |
+|mathpix                           |                                                                                                                           |                                                                                                                 |                                                                                                                       |✓    |            |          |             |✓                                   |                 |         |          |                |
+|Adobe Sensei                      |[developer.adobe.com](https://developer.adobe.com/document-services/apis/pdf-extract/)                                     |                                                                                                                 |                                                                                                                       |✓    |            |          |             |✓                                   |                 |         |          |                |
+|AWS TextExtract                   |                                                                                                                           |                                                                                                                 |                                                                                                                       |✓    |            |          |             |✓                                   |                 |         |          |                |
+|Azure Document Intelligence       |[azure.microsoft.com](https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/)                         |                                                                                                                 |                                                                                                                       |✓    |            |          |             |✓                                   |                 |         |          |                |
+|Google Document AI                |[cloud.google.com](https://cloud.google.com/document-ai?hl=en)                                                             |                                                                                                                 |                                                                                                                       |✓    |            |          |             |✓                                   |                 |         |          |                |
diff --git a/pyproject.toml b/pyproject.toml
@@ -20,6 +20,7 @@ dependencies = [
   "timm",
   "pillow",
   "pandas",
+  "matplotlib",
   
 ]
 
diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,6 @@
 pypdfium2
 transformers[torch]
 timm
+matplotlib
 pillow
 pandas

Original file line number	Diff line number	Diff line change
`@@ -20,6 +20,7 @@ dependencies = [`
`20`	`20`	`"timm",`
`21`	`21`	`"pillow",`
`22`	`22`	`"pandas",`
	`23`	`+ "matplotlib",`
`23`	`24`
`24`	`25`	`]`
`25`	`26`