Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR 工具tesseract初体验 #1

Open
leon0625 opened this issue Jan 8, 2018 · 0 comments
Open

OCR 工具tesseract初体验 #1

leon0625 opened this issue Jan 8, 2018 · 0 comments

Comments

@leon0625
Copy link
Owner

leon0625 commented Jan 8, 2018

OCR 工具tesseract初体验

@(工具使用)[工具使用, python]

OCR即图片上文字识别

安装tesseract

github地址
tesseract是一个命令行程序,后面安装的pytesseract也只是一层包装,实际还是调用命令行

下载
windows版下载地址

安装
下载完之后安装时点下一步慢点,因为安装的时候可以下载中文语言包
image

设置环境变量
安装完之后需要设置两个环境变量

  1. 把安装路径添加到PATH环境变量
  2. 设置TESSDATA_PREFIX环境变量,不然找不到语言包
    TESSDATA_PREFIX=D:\Program Files (x86)\Tesseract-OCR\tessdata

这时命令行版tesseract就可以使用了

安装pytesseract

pip install pytesseract

测试程序:

import pytesseract
from PIL import Image


# 默认英语
image = Image.open('en.png')
text = pytesseract.image_to_string(image)
print(text)

print("====================")

# 识别中文, 巨慢
image = Image.open('cn.png')
text = pytesseract.image_to_string(image, lang='chi_sim')
print(text)

print("====================")

# 设置中文和英语,识别巨慢,而且易错
image = Image.open('en_cn_test.png')
text = pytesseract.image_to_string(image, lang='chi_sim+eng')
print(text)

测试结果:


英文测试
en

识别后内容
enr


中文测试
cn

识别后内容
cnr


中文加英文测试
en_cn_test

识别后内容
rrr


测试结论:
英文识别能力还行,速度也不错,很快。中文识别就很吃力了,速度慢而且识别率不高,几乎不能直接用

参考:
Python--文字识别--Tesseract

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant