You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Starting conversion of ./Testing Double Text.pdf
WARNING: Empty pdf, cannot determine dpi using pdfimages
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.12.6 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
pdf2txt.py Testing\ Double\ Text_ocr.pdf
Testing Double Text
Testing Double Text
Testing Double Text
Testing Double Text
Testing Double Text
Testing Double Text
For me it appears it may be adding an OCR layer to a file that already has one, thus doubling it?
Hi Virantha,
I'm in the process of OCRing newspaper article pdfs, but it seems like the module is doubling the text of the document.
For example, if in the document it reads:
``XXXXXX
YYYYY
ZZZZZZZZZ"
The output of pypdfocr will read:
``XXXXXX
XXXXXX
YYYYY
YYYYY
ZZZZZZZZZ
ZZZZZZZZZ"
Any idea how to fix this problem? Is there a way to increase/decrease the resolution that pypdfocr (Tesseract) employs?
The text was updated successfully, but these errors were encountered: