Skip to content

Segmentation and classification of OCR data based on regions of interest.

Notifications You must be signed in to change notification settings

dupree/OCR_text_segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

OCR_text_segmentation

Segmentation and classification of OCR data based on regions of interest.

Problem Description

The dataset contains invoices that have been processed via Optical Character Recognition (OCR) system and outputted in json format. Goal is to build a classification model that extracts accounting information from an invoice. To this end, the first step is to find relevant items from the dataset.

About Dataset

train.json and test.json files contain set of invoice data in json format.

“id” denotes the invoice ID, and “words” is a list of words (generated by OCR) in current invoice. Each word contains the “value” (text) data (for confidentiality purpose, the characteristic is anonymized, which might slightly affect the prediction accuracy).

The bounding box information of the text, i.e. “region” is included. In addition, “page” (page number) can be found below the bounding box data, as an invoice may contain multiple pages. Words that are of relevance are labelled in “entities”.

  • For example, in the below sample, words "gqeUrQ==" and "dKGlmYCFXw==" are items, whereas "eqCW" is not.
{  
   "id":456561110,
   "words":[  
      {  
         "value":"gqeUrQ==",
         "region":{  
            "left":0.27226892,
            "top":0.032104637,
            "width":0.03529412,
            "height":0.0083234245,
            "page":"1"
         }
      },
      {  
         "value":"dKGlmYCFXw==",
         "region":{  
            "left":0.27226892,
            "top":0.058263972,
            "width":0.0605042,
            "height":0.007134364,
            "page":"1"
         }
      },
      {  
         "value":"eqCW",
         "region":{  
            "left":0.33613446,
            "top":0.058263972,
            "width":0.020168068,
            "height":0.007134364,
            "page":"1"
         }
      }
   ],
   "entities":[  
      {  
         "metaData":{  
            "region":{  
               "page":1
            }
         },
         "label":"item",
         "indices":[  
            0,
            1
         ]
      }
   ]
}

Task

Main task is to implement an approach to detect items(of relevance) from an input invoice.

Questions addressed:

  1. What is good measure for classification accuracy?
  2. What are possible shortcomings and extensions of the implementation?
  3. How to design a real-time performance system that responds to a high volume of prediction requests efficiently.

About

Segmentation and classification of OCR data based on regions of interest.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published