Chinese OCR help #14507

luiliRo · 2025-01-07T02:27:32Z

luiliRo
Jan 7, 2025

Hi ,

So i have several images of chinese text and have been trying to ocr it with paddle ocr. It was working well, but i noticed that it missed quotations "" and elipses, so instead of below i get:

Then i noticed that some of the characters would get cut off, so 几 would get "cut" into 儿 .These two i somewhat managed to fix with image preprocessing.

When i was proof reading, i noticed one part

this specific character 凪 would not get oc'rd no matter what. It would instead be replaced by 正 such as 然的正音.

This text is from a chinese story with japanese themes, such as names being japanese like.

So i am wondering if someone might now how to properly ocr, with quations and periods elipses being fully captured, and not getting the wrong chinese characters.

For more context, the base image i use has been upscaled, and the upscale is quite good as it did not bleed or mush.

Here is the code i have been using so far:

from paddleocr import PaddleOCR, draw_ocr
from PIL import Image

# Initialize PaddleOCR with PP-OCRv3 models for Chinese language
ocr = PaddleOCR(
    use_angle_cls=True, 
    lang='ch',
    ocr_version='PP-OCRv3',
)
# Path to your image file
img_path = PATH


image = Image.open(img_path).convert('L')  # grayscale
threshold_value = 200
# Anything above 200 => white; anything below => black
binary_image = image.point(lambda p: 255 if p > threshold_value else 0, '1')

binary_image.save("preprocessed_binary.png")




result = ocr.ocr("preprocessed_binary.png", cls=True)




# Extract recognized text and save to a .txt file
with open('recognized_text.txt', 'w', encoding='utf-8') as f:
    for idx in range(len(result)):
        res = result[idx]
        for line in res:
            text = line[1][0]
            confidence = line[1][1]
            f.write(f"{text}\n")

# Draw OCR results on the image
image = Image.open("preprocessed_binary.png").convert('RGB')
boxes = [line[0] for line in result[0]]
txts = [line[1][0] for line in result[0]]
scores = [line[1][1] for line in result[0]]
im_show = draw_ocr(image, boxes, txts, scores, font_path='./simfang.ttf')
im_show = Image.fromarray(im_show)

# Save the result image
im_show.save('result.jpg')
I am wondering if anyone know how to reliably capture quotations and

GreatV · 2025-01-07T02:37:49Z

GreatV
Jan 7, 2025
Maintainer

To address the issues you're encountering with Chinese OCR using PaddleOCR, let's break down the problems and discuss possible solutions:

1. Issue with Missing Quotations ("") and Ellipses (...):

Cause: PaddleOCR may not handle certain punctuation marks like quotations and ellipses effectively due to font rendering, input resolution, or preprocessing inconsistencies.
Solution:
- Font Language Model Update: Ensure that the OCR model being used (e.g., PP-OCRv3) includes proper Chinese punctuation in its recognition dictionary. PaddleOCR's default character dictionary can be customized by modifying the file path in rec_char_dict_path.
  - Example:
```
ocr = PaddleOCR(
    use_angle_cls=True,
    lang='ch',
    rec_char_dict_path='./path_to_custom_dict.txt',
)
```
    In the custom dictionary file (custom_dict.txt), include additional characters such as “, ”, and ….
- Preprocessing Adjustments:
  - Ensure that the image resolution is high enough for small punctuation marks to be clearly visible.
  - Use binarization thresholds and contrast enhancement to avoid merging or erasing small characters like ellipses.
- Alternative OCR Tools: If PaddleOCR struggles, consider combining it with another OCR engine (e.g., Tesseract) to cross-check results for punctuation marks.

2. Incorrect Recognition of Characters (e.g., 凪 misrecognized as 正):

Cause:
- Certain characters like 凪 are rare in Chinese texts but may appear in Japanese-themed content. The misrecognition occurs because PaddleOCR's Chinese model (lang='ch') is trained primarily on common Chinese characters and may not include rare or Japanese Kanji.
Solution:
- Use Multi-Language Model:
  - PaddleOCR supports Japanese (lang='japan') and mixed-language models. For texts with Japanese themes, try switching to the Japanese recognition model or a mixed dataset.
  - Example:
```
ocr = PaddleOCR(use_angle_cls=True, lang='japan')
```
  - Alternatively, set lang='chinese_cht' to include traditional Chinese characters, which may overlap with Japanese Kanji.
- Fine-Tuning the Model:
  - Fine-tune PaddleOCR's recognition model with additional data that includes less common characters like 凪. This requires creating a dataset with such characters and retraining the recognition model.
  - Refer to PaddleOCR's fine-tuning guide.
- Cross-Validate with Tesseract: Tesseract OCR with the chi_sim or jpn language packs may better recognize rare or Japanese characters.

3. Characters Breaking into Subcomponents (e.g., 几 → 儿):

Cause:
- This issue often arises due to incorrect image scaling, binarization thresholds, or model limitations.
Solution:
- Image Preprocessing:
  - Use higher-quality upscaling methods (e.g., ESRGAN or waifu2x) to preserve character structure.
  - Avoid aggressive binarization thresholds. For instance, try using adaptive thresholding instead of a fixed value:
```
from PIL import Image, ImageOps

image = Image.open(img_path).convert('L')  # Grayscale
binary_image = ImageOps.autocontrast(image)
binary_image.save("preprocessed_binary.png")
```
- Model Adjustment:
  - Ensure that the OCR model version is up-to-date (e.g., PP-OCRv4).
  - Test with different detection algorithms (e.g., DB, EAST) to see if they better detect character boundaries.

4. General Recommendations for Improved OCR Accuracy:

Drop Confidence Threshold (drop_score):
- Adjust the drop_score parameter to include low-confidence predictions for punctuation or rare characters:
```
ocr = PaddleOCR(use_angle_cls=True, lang='ch', drop_score=0.3)
```

Post-Processing Corrections:

Use a post-processing script to replace commonly misrecognized characters based on context. For example:

corrections = {"正": "凪", "...": "…", "\"": "“"}
for key, value in corrections.items():
    recognized_text = recognized_text.replace(key, value)

Debugging Output:
- Visualize OCR detection results (draw_ocr) to verify if the bounding boxes accurately cover the text area. This helps diagnose issues with detection vs. recognition.

5. Relevant PaddleOCR Discussions:

The issue of missed detections and misrecognitions has been raised in PaddleOCR Discussion #13849. While no direct solution is provided, upgrading the OCR version (PP-OCRv4) and fine-tuning the model are suggested approaches.

Final Code Example:

Here’s an updated script incorporating the suggestions:

from paddleocr import PaddleOCR, draw_ocr
from PIL import Image, ImageOps

# Initialize PaddleOCR with language-specific model for Chinese
ocr = PaddleOCR(
    use_angle_cls=True,
    lang='ch',  # Try 'japan' or 'chinese_cht' for mixed content
    ocr_version='PP-OCRv4',
    drop_score=0.3,  # Include low-confidence results
    rec_char_dict_path='./custom_dict.txt',  # Optional custom dictionary
)

# Preprocess image
img_path = 'path_to_image.png'
image = Image.open(img_path).convert('L')  # Grayscale
binary_image = ImageOps.autocontrast(image)  # Adaptive thresholding
binary_image.save("preprocessed_binary.png")

# Perform OCR
result = ocr.ocr("preprocessed_binary.png", cls=True)

# Extract and save recognized text
with open('recognized_text.txt', 'w', encoding='utf-8') as f:
    for line in result[0]:
        text, confidence = line[1][0], line[1][1]
        f.write(f"{text}\n")

# Visualize OCR results
from paddleocr import draw_ocr
image = Image.open("preprocessed_binary.png").convert('RGB')
boxes = [line[0] for line in result[0]]
txts = [line[1][0] for line in result[0]]
scores = [line[1][1] for line in result[0]]
im_show = draw_ocr(image, boxes, txts, scores, font_path='./simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

By making these adjustments, you should see improved recognition performance for Chinese text, punctuation, and rare characters.

Response generated by feifei-bot | chatgpt-4o-latest

1 reply

luiliRo Jan 7, 2025
Author

So after testing it again, with
ocr = PaddleOCR( use_angle_cls=True, lang='japan', ocr_version='PP-OCRv4', )

The missing characters i needed was there, even with wrong chinese ocr for the rest. So the japanese dictionary has it

From looking at past threads #12747, would it be possible to take the existing chinese dictionary and combine it with the japanese dictionary?

Would this be feasible, and it looks like i would need to retrain, would this have any issues. I am pretty new to this, so i don't know how to train a model yet or the hurdles of it yet.

Right now the chinese ocr, does like 95% of the ocr really well, with the last 5% being the japanese characters and the "" marks

jingsongliujing · 2025-01-07T06:00:39Z

jingsongliujing
Jan 7, 2025
Collaborator

You can refer to this document, which can guide you through the custom training of related models：https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese OCR help #14507

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Chinese OCR help #14507

luiliRo Jan 7, 2025

Replies: 2 comments · 1 reply

GreatV Jan 7, 2025 Maintainer

1. Issue with Missing Quotations ("") and Ellipses (...):

2. Incorrect Recognition of Characters (e.g., 凪 misrecognized as 正):

3. Characters Breaking into Subcomponents (e.g., 几 → 儿):

4. General Recommendations for Improved OCR Accuracy:

5. Relevant PaddleOCR Discussions:

Final Code Example:

luiliRo Jan 7, 2025 Author

jingsongliujing Jan 7, 2025 Collaborator

luiliRo
Jan 7, 2025

Replies: 2 comments 1 reply

GreatV
Jan 7, 2025
Maintainer

luiliRo Jan 7, 2025
Author

jingsongliujing
Jan 7, 2025
Collaborator