Segmentation and classification of OCR data based on regions of interest.
- please utilize https://nbviewer.jupyter.org/github/dupree/OCR_text_segmentation/blob/master/data_exploration.ipynb to render the jupyter notebook in case browser fails to render it.
The dataset contains invoices that have been processed via Optical Character Recognition (OCR) system and outputted in json format. Goal is to build a classification model that extracts accounting information from an invoice. To this end, the first step is to find relevant items from the dataset.
train.json and test.json files contain set of invoice data in json format.
“id” denotes the invoice ID, and “words” is a list of words (generated by OCR) in current invoice. Each word contains the “value” (text) data (for confidentiality purpose, the characteristic is anonymized, which might slightly affect the prediction accuracy).
The bounding box information of the text, i.e. “region” is included. In addition, “page” (page number) can be found below the bounding box data, as an invoice may contain multiple pages. Words that are of relevance are labelled in “entities”.
- For example, in the below sample, words "gqeUrQ==" and "dKGlmYCFXw==" are items, whereas "eqCW" is not.
{
"id":456561110,
"words":[
{
"value":"gqeUrQ==",
"region":{
"left":0.27226892,
"top":0.032104637,
"width":0.03529412,
"height":0.0083234245,
"page":"1"
}
},
{
"value":"dKGlmYCFXw==",
"region":{
"left":0.27226892,
"top":0.058263972,
"width":0.0605042,
"height":0.007134364,
"page":"1"
}
},
{
"value":"eqCW",
"region":{
"left":0.33613446,
"top":0.058263972,
"width":0.020168068,
"height":0.007134364,
"page":"1"
}
}
],
"entities":[
{
"metaData":{
"region":{
"page":1
}
},
"label":"item",
"indices":[
0,
1
]
}
]
}
Main task is to implement an approach to detect items(of relevance) from an input invoice.
Questions addressed:
- What is good measure for classification accuracy?
- What are possible shortcomings and extensions of the implementation?
- How to design a real-time performance system that responds to a high volume of prediction requests efficiently.