invoice-data-extraction

Using ML and DL to extract information from documents mostly invoices on Windows platform.

Prerequisites

Python 3.4+ installed globally.

Git LFS - for installation of model weights
Anaconda - for setting up environment

Add conda to path or open the project using the Anaconda prompt. But Anaconda is a heavy utility, you can choose miniconda.

Install miniconda, then add conda to path or open the project using the miniconda prompt

Installing Tessaract OCR:

Install the latest version of tessaract OCR into the C directory and add the path (C:\Program Files \Tesseract-OCR) to both System and User environment variables in Windows. Download the additional eng_layer.traineddata file and add it to C:\Program Files\Tesseract-OCR\tessdata

Install poppler

Running the Code.

Clone the repository or downlaod the zip file from GitHub

 git clone https://github.com/abhayhk2001/document-data-extraction

Open a Terminal window in the same folder as the downloaded code. Create a conda environment from the yml file and activate it as follows

conda env create -f env.yml
conda activate data-extraction

Add the invoice to examples subfolder.
To run the application.

 python main.py --file [filename relative path]

Example

 python main.py --file examples\airtel_june_2012.pdf

Results are stored in results.txt and table.csv within runs/detect/exp* directories.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
models		models
ocr-layers		ocr-layers
poppler-22.04.0		poppler-22.04.0
table-extraction		table-extraction
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
data.yaml		data.yaml
dataextraction.py		dataextraction.py
env.yml		env.yml
main.py		main.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
yolov.py		yolov.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

invoice-data-extraction

Prerequisites

Installing Tessaract OCR:

Install poppler

Running the Code.

About

Releases

Packages

Languages

AnirudhJM24/idefinal

Folders and files

Latest commit

History

Repository files navigation

invoice-data-extraction

Prerequisites

Installing Tessaract OCR:

Install poppler

Running the Code.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages