-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unwrapping paragraphs #23
Comments
Tesseract will probably be able to do this in the future: tesseract-ocr/tesseract#728. If the Tesseract's recognition process can pick the right "dehyphenation" rules on a per-language basis, that's all we need. Otherwise, removing hyphens on the dpScreenOCR side will require either a library for natural language processing or at least a spell checking library. In either case, the task is not trivial, since the recognized text can contain fragments in different languages. It will also require users to install extra data in addition to Tesseract languages. Processing on the Tesseract side would definitely be the best solution, so I'd rather wait for tesseract-ocr/tesseract#728 for a while (although the issue more than 5.5 years old :) |
Unfortunately, the "naive" algorithm will not work in most cases, removing hyphens when they should be kept, e.g., "twentieth-century music". If you don't mind this kind of de-hyphenation, you can do it in a script executed via the "Run executable" action. In fact, this way it's easy to implement the proper algorithm, which will remove hyphens only if the deh-hyphenated word is in the list of valid words in a file. For French, you can download such a list here: https://salsa.debian.org/gpernot/wfrench/-/blob/master/french On Unix-like systems, you can also install this file (as |
Thank you. But it looks like the argument from dpScreenOCR has no '\n' in $1. This means I can not replace This is how my script looks like:
Content of the file ScreenOCR.txt:
--> Confused: It looks like there has not replaced any '\n' . But they are still there. (?) Very strange. |
I'm not skilled enough in Bash, so here is a simple Python script that unwraps paragraphs using Aspell for spell checking. You will need to install the needed Aspell language (e.g. The script works not only with the ASCII hyphen, but also with other kind of dashes (en dash, em dash, etc.). You may want to remove the second call to #!/usr/bin/env python3
import datetime
import os
import subprocess
import sys
import unicodedata
ASPELL_LANG = 'fr'
APPEND_TO_FILE = os.path.expanduser("~/ocr_history.txt")
def is_dash(c):
return unicodedata.category(c) == 'Pd'
def is_valid_word(word):
with subprocess.Popen(
('aspell',
'-a',
'--lang=' + ASPELL_LANG,
'--dont-suggest'),
stdout=subprocess.PIPE,
stdin=subprocess.PIPE,
universal_newlines=True) as p:
# ! to enter the terse mode (don't print * for correct words).
# ^ to spell check the rest of the line.
aspell_out = p.communicate(input='!\n^' + word)[0]
# We use this function to check words both with and without
# dashes. In the later case, Aspell checks each dash-separated
# part as an individual word.
#
# If all words are correct in the terse mode, the output will be
# a version info and an empty line.
return aspell_out.count('\n') == 2
def unwrap_paragraphs(text, out_f):
para = ''
for line in text.splitlines():
if not line:
# Empty line is a paragraph separator
if para:
out_f.write(para)
out_f.write('\n')
para = ''
out_f.write('\n')
continue
if not para:
para = line
continue
if not is_dash(para[-1]):
para += ' '
para += line
continue
para_rpartition = para.rpartition(' ')
para_last_word = para_rpartition[2]
line_lpartition = line.partition(' ')
line_first_word = line_lpartition[0]
word_with_dash = para_last_word + line_first_word
word_without_dash = para_last_word[:-1] + line_first_word
if (is_valid_word(word_without_dash)
# If the word valid both with and without the dash,
# keep the dashed variant.
and not is_valid_word(word_with_dash)):
para = (para_rpartition[0]
+ para_rpartition[1]
+ word_without_dash
+ line_lpartition[1]
+ line_lpartition[2])
else:
para += line
if para:
out_f.write(para)
if __name__ == '__main__':
with open(APPEND_TO_FILE, 'a', encoding='utf-8') as out_f:
out_f.write(
'=== {} ===\n\n'.format(
datetime.datetime.now().strftime(
"%Y-%m-%d %H:%M:%S")))
unwrap_paragraphs(sys.argv[1], out_f)
out_f.write('\n\n') |
Thank you very much for your script. 👍 :-) I have made the file "dpScreenOCRPython.py" with the content of this script and have added the path to it into the "action" tab. Why only "more or less" ? Result from tesseract:
And of course your Python script converts this into:
So it looks like it is not enough when Python only looks at '\n' . It should also convert '\n\n' Second: I will not use the action-options ... |
To copy text to the clipboard, you can use if __name__ == '__main__':
unwrap_paragraphs(sys.argv[1], sys.stdout) This way, the script will print to standard output instead of file, so you will be able to invoke it in a Bash script and then call #!/bin/bash
TEXT=$(~/dpScreenOCRPython.py "$1")
xsel --clipboard <<< "$TEXT" Unfortunately, removing empty lines will unconditionally join all paragraphs. This is something that should be done on Tesseract side; they already have an issue on the tracker: tesseract-ocr/tesseract#2155. If you don't mind removing all empty lines, you can do it with #!/usr/bin/env python3
import sys
lines = sys.argv[1].splitlines()
for i, line in enumerate(lines):
if (not line
and i + 1 < len(lines)
and (not lines[i + 1]
or lines[i + 1][0].islower())):
continue
print(line) You can combine both scripts like: #!/bin/bash
TEXT=$(~/remove_empty_lines.py "$1")
TEXT=$(~/dpScreenOCRPython.py "$TEXT")
xsel --clipboard <<< "$TEXT" |
Thank you very much. That's great stuff. I think this is good enough for my purpose (translating from French into German with DeepL). |
Follow up: Your original Python script (#23 (comment)) makes two things:
into
into Actually in the meantime I would prefer a script that only makes the
replacement. Thank you. |
In the block that starts with |
Thank you. |
Is there any way to use it without aspell? |
You can replace aspell with another spell checker (e.g. hunspell), but without a spell checker the script will be useless since there will be no way to tell if a word without a hyphen is correct. |
@danpla okay, there are too many script which one should dpscreenocr execute the bash one or? Traceback (most recent call last):
File "/home/tbb/dpScreenOCRPython.py", line 94, in <module>
unwrap_paragraphs(sys.argv[1], out_f)
IndexError: list index out of range |
It looks like you called the script without an argument. |
I already do the way but no works, I want to have the stuff that dpscreenocrpython.py fixed on my clipboard but I guess I have to use the bash script to achieve it, I do not know what to do can you give instruction for who does not know any coding stuff |
@danpla I made it work somehow dunno, is there any way to make it slee-py to sleepy I mean when - in middle or some? |
It should work automatically if you set English by changing |
@danpla you should add it as feature to dpscreen though, sometimes it does not work at all weird, thanks anyway |
@danpla it works on terminal (dpscreenocrpy) but does not work on run executable option should I open other options (copy to text clipboard add text to history?) |
By default, the Python script appends text to the |
@danpla but I dont understand it, should I execute to bash script or python script to get work this on dpscreenocr since dpscreenocr cant execute multiple stuff |
You should use the bash script (the piece of code that starts with You will need to disable the "Copy text to clipboard" action, since otherwise it will overwrite the clipboard text set by |
@danpla thanks for help, can you make it work for like these examples? |
@danpla |
This option has no effect on how the script works. But if you're capturing several columns of text at once, then it probably makes sense to enable "Split text blocks," regardless of whether you're using this script. |
@danpla how can use make it work on windows.? |
It should work on Windows if you install Python and Aspell (http://aspell.net/win32/), but I haven't tested it. You should also make sure that the directory with the Aspell executable is in your PATH environment variable. |
Thank you for the link. I suppose, that you have made a script, that can be found on https://github.com/Green0wl/ocrf that does, what here in this thread was asked for. But now my question follows: Suggestion: Maybe you could make options in your programme, so that the user can activate different versions of formatting. |
@Golddouble The script does what you asked for in points A and B (and C if there is no hyphen at the end of the line). when scanning your first picture, I get
French specific characters are omitted because I don't have the package for that language installed. here you can clearly see that the hyphen between when scanning the second picture:
looking at the result, I thought that maybe I should loop through each line again to remove the honestly, reading this thread, I can only see one implementation that you have been improving. you can try the script I sent you earlier and suggest your specific improvements, cause, unfortunately, I don't understand which options for formatting versions you are talking about. and I also use it all for the same purpose as you - for deepl. and already with this result it doesn't go crazy like it used to. speaking of this comment, my script does all the same things you described in it. thanks for the reply! happy to make corrections! |
Thank you for creating this cool little app. But there is someting important I really miss.
When I try to use dpScreenOCR with this picture
Then I get this output:
But what I would like to have is this:
Eh bien, je m’entraîne beaurcoup, je me prépare pour les championnats suisses, après je voudrais partir pour les Etats-Unis. Mais ma mère préfère rester en Europe.
What is the difference?
A) I prefer to have a mode that produces no line breaks.
B) Separate words like "beaur-
coup" at hte end of a line should be written together: "beaurcoup".
C) Hyphenated words in which the hyphen is not used to separate (like "Etats-Unis"): The hyphen should be retained here.
If you can not implement C) it would be good inough to have A) and B) .
What do you think?
The text was updated successfully, but these errors were encountered: