OCR_text_segmentation

Segmentation and classification of OCR data based on regions of interest.

please utilize https://nbviewer.jupyter.org/github/dupree/OCR_text_segmentation/blob/master/data_exploration.ipynb to render the jupyter notebook in case browser fails to render it.

Problem Description

The dataset contains invoices that have been processed via Optical Character Recognition (OCR) system and outputted in json format. Goal is to build a classification model that extracts accounting information from an invoice. To this end, the first step is to find relevant items from the dataset.

About Dataset

train.json and test.json files contain set of invoice data in json format.

“id” denotes the invoice ID, and “words” is a list of words (generated by OCR) in current invoice. Each word contains the “value” (text) data (for confidentiality purpose, the characteristic is anonymized, which might slightly affect the prediction accuracy).

The bounding box information of the text, i.e. “region” is included. In addition, “page” (page number) can be found below the bounding box data, as an invoice may contain multiple pages. Words that are of relevance are labelled in “entities”.

For example, in the below sample, words "gqeUrQ==" and "dKGlmYCFXw==" are items, whereas "eqCW" is not.

{  
   "id":456561110,
   "words":[  
      {  
         "value":"gqeUrQ==",
         "region":{  
            "left":0.27226892,
            "top":0.032104637,
            "width":0.03529412,
            "height":0.0083234245,
            "page":"1"
         }
      },
      {  
         "value":"dKGlmYCFXw==",
         "region":{  
            "left":0.27226892,
            "top":0.058263972,
            "width":0.0605042,
            "height":0.007134364,
            "page":"1"
         }
      },
      {  
         "value":"eqCW",
         "region":{  
            "left":0.33613446,
            "top":0.058263972,
            "width":0.020168068,
            "height":0.007134364,
            "page":"1"
         }
      }
   ],
   "entities":[  
      {  
         "metaData":{  
            "region":{  
               "page":1
            }
         },
         "label":"item",
         "indices":[  
            0,
            1
         ]
      }
   ]
}

Task

Main task is to implement an approach to detect items(of relevance) from an input invoice.

Questions addressed:

What is good measure for classification accuracy?
What are possible shortcomings and extensions of the implementation?
How to design a real-time performance system that responds to a high volume of prediction requests efficiently.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
README.md		README.md
data_exploration.ipynb		data_exploration.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR_text_segmentation

Problem Description

About Dataset

Task

About

Releases

Packages

Languages

dupree/OCR_text_segmentation

Folders and files

Latest commit

History

Repository files navigation

OCR_text_segmentation

Problem Description

About Dataset

Task

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages