Replies: 2 comments 1 reply
-
To address the issues you're encountering with Chinese OCR using PaddleOCR, let's break down the problems and discuss possible solutions: 1. Issue with Missing Quotations ("") and Ellipses (...):
2. Incorrect Recognition of Characters (e.g., 凪 misrecognized as 正):
3. Characters Breaking into Subcomponents (e.g., 几 → 儿):
4. General Recommendations for Improved OCR Accuracy:
5. Relevant PaddleOCR Discussions:
Final Code Example:Here’s an updated script incorporating the suggestions: from paddleocr import PaddleOCR, draw_ocr
from PIL import Image, ImageOps
# Initialize PaddleOCR with language-specific model for Chinese
ocr = PaddleOCR(
use_angle_cls=True,
lang='ch', # Try 'japan' or 'chinese_cht' for mixed content
ocr_version='PP-OCRv4',
drop_score=0.3, # Include low-confidence results
rec_char_dict_path='./custom_dict.txt', # Optional custom dictionary
)
# Preprocess image
img_path = 'path_to_image.png'
image = Image.open(img_path).convert('L') # Grayscale
binary_image = ImageOps.autocontrast(image) # Adaptive thresholding
binary_image.save("preprocessed_binary.png")
# Perform OCR
result = ocr.ocr("preprocessed_binary.png", cls=True)
# Extract and save recognized text
with open('recognized_text.txt', 'w', encoding='utf-8') as f:
for line in result[0]:
text, confidence = line[1][0], line[1][1]
f.write(f"{text}\n")
# Visualize OCR results
from paddleocr import draw_ocr
image = Image.open("preprocessed_binary.png").convert('RGB')
boxes = [line[0] for line in result[0]]
txts = [line[1][0] for line in result[0]]
scores = [line[1][1] for line in result[0]]
im_show = draw_ocr(image, boxes, txts, scores, font_path='./simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg') By making these adjustments, you should see improved recognition performance for Chinese text, punctuation, and rare characters. Response generated by feifei-bot | chatgpt-4o-latest |
Beta Was this translation helpful? Give feedback.
-
You can refer to this document, which can guide you through the custom training of related models:https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html |
Beta Was this translation helpful? Give feedback.
-
Hi ,
So i have several images of chinese text and have been trying to ocr it with paddle ocr. It was working well, but i noticed that it missed quotations "" and elipses, so instead of below i get:
![image](https://private-user-images.githubusercontent.com/194041774/400611658-d5b7859d-78ad-40ac-8739-b3bbbc78cc8c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTk3NjAsIm5iZiI6MTczODk1OTQ2MCwicGF0aCI6Ii8xOTQwNDE3NzQvNDAwNjExNjU4LWQ1Yjc4NTlkLTc4YWQtNDBhYy04NzM5LWIzYmJiYzc4Y2M4Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QyMDE3NDBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03NzkzN2E5ZWRlMGMzNjJiZjc4ZWJjN2YwZjg1NWMzODcyOTExMTBmOThkZWI0MGQxMjdmNjJjYjEwM2Y3N2Y5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.U-zeUVjF5IRHrP7RZ5r3dQLx4Kn8vTvcSLmpRWPcfXI)
![image](https://private-user-images.githubusercontent.com/194041774/400611790-76da80eb-dbcb-401b-8483-1034cfdbf600.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTk3NjAsIm5iZiI6MTczODk1OTQ2MCwicGF0aCI6Ii8xOTQwNDE3NzQvNDAwNjExNzkwLTc2ZGE4MGViLWRiY2ItNDAxYi04NDgzLTEwMzRjZmRiZjYwMC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QyMDE3NDBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lNjMxNDE2ZmE2N2Q5NjJlZjkwMTdkYTE4NDNlNTFjMDQ0ZmFjZTVlNjczMTA4NmI1ODZlMjZlOTY5Yjg2NGM4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.hNq5yATkaLSdHkQfPbburcZCoGGDgCKTpFQVA12Z91U)
Then i noticed that some of the characters would get cut off, so 几 would get "cut" into 儿 .These two i somewhat managed to fix with image preprocessing.
When i was proof reading, i noticed one part
![image](https://private-user-images.githubusercontent.com/194041774/400610235-b47be04b-0c24-4bea-9c25-70104774e414.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTk3NjAsIm5iZiI6MTczODk1OTQ2MCwicGF0aCI6Ii8xOTQwNDE3NzQvNDAwNjEwMjM1LWI0N2JlMDRiLTBjMjQtNGJlYS05YzI1LTcwMTA0Nzc0ZTQxNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QyMDE3NDBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0xNTIyNDk4YjliMGZjYzNmNDQ2NDI0MTE1MjIwM2E4NmUzMTk0ZDU1NGRhMzlmYTU0M2ZhNDljMzgwMjRmN2RmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.aZ4aXEEUtpidXvxTCUg2as0oFhK6uIqe3MV8ycOO8jc)
![image](https://private-user-images.githubusercontent.com/194041774/400610599-abddec6d-8bb2-4f80-9d52-7211cd911ef4.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTk3NjAsIm5iZiI6MTczODk1OTQ2MCwicGF0aCI6Ii8xOTQwNDE3NzQvNDAwNjEwNTk5LWFiZGRlYzZkLThiYjItNGY4MC05ZDUyLTcyMTFjZDkxMWVmNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QyMDE3NDBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iZTFhNGNkN2QwNGVhMDQ4MTA1ZGU2MWNhY2MyM2ZlNjM1MDhiMTIyMTc0NGVjMDE0YzZhMjQyZWVjMjJmYjdmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.XyTlISDJUan4o86UJbhJMupXNG0EjhI4tFoIvoP1q4I)
this specific character 凪 would not get oc'rd no matter what. It would instead be replaced by 正 such as 然的正音.
This text is from a chinese story with japanese themes, such as names being japanese like.
So i am wondering if someone might now how to properly ocr, with quations and periods elipses being fully captured, and not getting the wrong chinese characters.
For more context, the base image i use has been upscaled, and the upscale is quite good as it did not bleed or mush.
Here is the code i have been using so far:
Beta Was this translation helpful? Give feedback.
All reactions