Replies: 1 comment
-
No, the dataset you have, which only contains the text values for fields of interest without the corresponding coordinates of those fields on the images, is not sufficient to train a Key Information Extraction (KIE) model like those supported by PaddleOCR or similar frameworks. Here's why and how you can proceed: Why Coordinates Are NecessaryTo train a KIE model, it is essential to have labeled data that provides both:
For models like PaddleOCR's KIE pipeline, the training process involves learning not only text recognition but also how to associate specific regions of the image with the correct text (field pairing). Without the spatial information (coordinates), the model cannot learn the relationship between the layout of the document and the text fields, which is critical for structured extraction tasks in tabular or form-based layouts. How You Can ProceedHere are a few approaches to make your dataset useful: 1. Manually Annotate a Subset of Data
2. Use Pre-trained Layout Detection Models
3. Synthetic Data Augmentation
4. Fine-tuning OCR Models with Weak Supervision
Additional Notes on PaddleOCRPaddleOCR supports both detection and recognition tasks. For KIE tasks, it typically requires datasets formatted with both detection (bounding box) and recognition annotations. For example:
If you want to learn more about dataset preparation for PaddleOCR, check the documentation on OCR dataset preparation. Recommendations
If you have more specific requirements or need further clarification, feel free to ask! Response generated by feifei-bot | chatgpt-4o-latest |
Beta Was this translation helpful? Give feedback.
-
I have to get the text out of photos and scans of the same kind of document which has a tabular layout. Only certain fields of it are of interest for me.
For many existing photos/scans I do have the correct text for each field of interest in a database.
Is that enough to train a model with paddleocr to be able to get field-value pairs out of new scans/photos? Or is the dataset without coordinates of the values/fields not useful yet?
Any advice how I could go ahead appreciated :)
Beta Was this translation helpful? Give feedback.
All reactions