Can I train a model for KIE on a large dataset with the text value but not the coordinates? #14597

belugame · 2025-01-26T21:55:16Z

belugame
Jan 26, 2025

I have to get the text out of photos and scans of the same kind of document which has a tabular layout. Only certain fields of it are of interest for me.

For many existing photos/scans I do have the correct text for each field of interest in a database.

Is that enough to train a model with paddleocr to be able to get field-value pairs out of new scans/photos? Or is the dataset without coordinates of the values/fields not useful yet?

Any advice how I could go ahead appreciated :)

GreatV · 2025-01-26T22:05:19Z

GreatV
Jan 26, 2025
Maintainer

No, the dataset you have, which only contains the text values for fields of interest without the corresponding coordinates of those fields on the images, is not sufficient to train a Key Information Extraction (KIE) model like those supported by PaddleOCR or similar frameworks. Here's why and how you can proceed:

Why Coordinates Are Necessary

To train a KIE model, it is essential to have labeled data that provides both:

Text values (field content): The actual content of the fields of interest.
Location/coordinates: The bounding boxes or areas on the image where the corresponding text appears.

For models like PaddleOCR's KIE pipeline, the training process involves learning not only text recognition but also how to associate specific regions of the image with the correct text (field pairing). Without the spatial information (coordinates), the model cannot learn the relationship between the layout of the document and the text fields, which is critical for structured extraction tasks in tabular or form-based layouts.

How You Can Proceed

Here are a few approaches to make your dataset useful:

1. Manually Annotate a Subset of Data

Use tools like LabelImg or Roboflow to manually annotate the coordinates of the fields of interest on a subset of your 70,000 images.
Even annotating 1,000–2,000 examples can be sufficient to fine-tune a pre-trained KIE model.

2. Use Pre-trained Layout Detection Models

Use a pre-trained document layout analysis model (like PaddleOCR's detection models or LayoutLM) to automatically detect and segment areas of interest in your documents.
Once the regions are identified, map the detected regions to the corresponding text values in your database. This step may require some heuristic rules or manual validation.

3. Synthetic Data Augmentation

If you know the structure of your documents (e.g., fixed tabular layout), you can generate synthetic datasets where you overlay text on blank templates of your document structure.
For these synthetic samples, you will have full control over the text and coordinates, which can then be used for training.

4. Fine-tuning OCR Models with Weak Supervision

Instead of training a full KIE model, you can fine-tune a text recognition model like en_PP-OCRv3_rec on your text data (without coordinates) to improve recognition accuracy for your specific field content.
Combine this with a rule-based approach or heuristic layout matching for field assignment.

Additional Notes on PaddleOCR

PaddleOCR supports both detection and recognition tasks. For KIE tasks, it typically requires datasets formatted with both detection (bounding box) and recognition annotations. For example:

Detection: Bounding box coordinates for specific fields.
Recognition: Text labels associated with those bounding boxes.

If you want to learn more about dataset preparation for PaddleOCR, check the documentation on OCR dataset preparation.

Recommendations

Start by annotating a small portion of your dataset to create a prototype model.
Leverage pre-trained PaddleOCR models for text detection and recognition to bootstrap your annotations.
Investigate KIE tools like LayoutLM, Tesseract (with layout analysis), or other document understanding frameworks if your use case involves complex layouts.

If you have more specific requirements or need further clarification, feel free to ask!

Response generated by feifei-bot | chatgpt-4o-latest

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I train a model for KIE on a large dataset with the text value but not the coordinates? #14597

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Can I train a model for KIE on a large dataset with the text value but not the coordinates? #14597

belugame Jan 26, 2025

Replies: 1 comment

GreatV Jan 26, 2025 Maintainer

Why Coordinates Are Necessary

How You Can Proceed

1. Manually Annotate a Subset of Data

2. Use Pre-trained Layout Detection Models

3. Synthetic Data Augmentation

4. Fine-tuning OCR Models with Weak Supervision

Additional Notes on PaddleOCR

Recommendations

belugame
Jan 26, 2025

GreatV
Jan 26, 2025
Maintainer