Add spellchecking #2

Theelx · 2021-01-01T19:24:38Z

I love this library already, I've been looking for something like this for a project of mine for months now! However, I saw the README said this about the project:
"and without all the extra stuff LanguageTool does such as spellchecking, n-gram based error detection, etc."

It would be super nice to have the spellchecking part of LanguageTool in this library, as spellchecking is one of the most used features in many, if not all, general-purpose NLP libraries. I'm only good at Python though, so I personally can't help until I focus more on improving my Rust :(

The text was updated successfully, but these errors were encountered:

bminixhofer · 2021-01-01T21:38:37Z

Hi, thanks! That's great to hear.

I thought LanguageTool used HunSpell which is why I did not consider porting it since HunSpell already has good Python ports.

However, reading the docs they say:

Another speller, based on finite-state automata (FSA), was included because of the speed problems with hunspell. The dictionary for the speller is built in the way similar to Developing a tagger dictionary. All it requires is a list of valid words in a language.

so porting that functionality to Rust might actually be interesting. Thanks for bringing that up, and I'll mark it as an enhancement for now.

I'll take a closer look once some more important things w.r.t speed and rule coverage are done.

bminixhofer · 2021-01-01T21:43:12Z

In the meantime you could use one of the Python wrappers for HunSpell before applying NLPRule. For separation of concerns I do not want to depend on HunSpell in NLPRule.

Theelx · 2021-01-01T22:36:39Z

Sounds good! I'm looking for speedy libraries though, so while yours is appealing, HunSpell is less so. I'll wait however long it takes for you to do your Rust magic ;)

bminixhofer · 2021-01-19T16:08:03Z

Just tracking some info here. There is some information at https://dev.languagetool.org/hunspell-support#morfologik. Specifically, LT uses a dictionary file that looks like this:

(base) ➜  ~/Downloads/lt/LanguageTool-5.2 java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i org/languagetool/resource/en/hunspell/en_GB.dict -info org/languagetool/resource/en/hunspell/en_GB.info -o /tmp/out.txt
Running Morfologik FSADecompile.main with these options: [--exit, false, -i, org/languagetool/resource/en/hunspell/en_GB.dict, -o, /var/folders/9j/zjq9wc317vdgzkrn6rlf9jdc0000gn/T/DictionaryExporter_separator10930816984953458335.txt]
Done. The dictionary export has been written to /tmp/out.txt
(base) ➜  ~/Downloads/lt/LanguageTool-5.2 head -n30 /tmp/out.txt
A
A	"
A
A	#
A
A	&
A
A	'
A
A	(
A
A	)
A
A	+
A
A	,
A
A	-lantern
A
A	-lanterns
A
A	-war
A
A	.
A
A	0
A
A	0th
A
A	1024x768
(base) ➜  ~/Downloads/lt/LanguageTool-5.2 wc -l /tmp/out.txt 
  508876 /tmp/out.txt

As far as I can tell this is the only resource used. It is parsed by the Morfologik library: https://github.com/morfologik/morfologik-stemming/tree/master/morfologik-speller.

bminixhofer · 2021-01-27T21:48:25Z

I am now certain that this is functionality nlprule should have. Some rules suppress misspellings so an integration of spellchecking with NLPRule is definitely better than spellchecking as a separate step in the pipeline. I'm still not sure when I'll get around to implementing this but it will be done.

Theelx · 2021-01-27T22:03:49Z

Thanks!

drahnr · 2021-02-26T09:02:04Z

Imho this is mixing concerns, making srx a plugable module, and just operate on tokenized streams, might be a better idea. Then one can use nlprule-spellcheck as an optional separate entity rather than mixing concerns in one library (while closely related).

bminixhofer · 2021-02-26T10:29:39Z

First off, for context, I think a custom implementation of the SymSpell algorithm for spellchecking will be best.

The hard constraint is that using the Python bindings enabling spellchecking has to be as simple as:

from nlprule import Tokenizer, Rules

# not necessarily exactly like this but equivalently easy
tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer, spellcheck=True)

Additionally the APIs of the Python lib and Rust crate are the same at the moment, it would be good to keep it that way.

Mixing of concerns is a valid point but in my opinion it is not an issue if it is (a) cleanly separated within the crate (easy) and (b) the effect of including spellchecking if disabled is negligible (not so easy).

Regarding (b) I think the key issue is size of the binary but it should be possible to store the data in the rules.bin without too much overhead. There is a list of valid words in the tokenizer anyway so the only thing that needs to be stored are the precomputed deletes - I hope it's possible to store these in a sufficiently compact way.

It would of course be possible to split nlprule into one high-level crate (equiv. to the Python API) and many subcrates for spellchecking, rule-based checking, sentence segmentation etc. but that would open up a bunch of other issues (public API of each crate (e. g. if the compilation step happens in another crate lots of things that should be private have to be public), how to separate the binaries, ...) and I don't think it is necessary if (b) can be done.

bminixhofer · 2021-02-27T13:33:08Z

split nlprule into one high-level crate (equiv. to the Python API) and many subcrates for spellchecking, rule-based checking, sentence segmentation etc.

I though about this a bit more and I do now think that something like this could make sense and be worth doing.

It would certainly need a significant amount of work though. I'll open another issue for modularizing the crate.

I'll still implement spellchecking the way I mentioned above though because I want the functionality as soon as possible. I have a mostly working version locally where the Rules load slightly longer (only if spellchecking is enabled) but there is almost zero impact on the binary size which adresses (b) from above.

bminixhofer · 2021-03-06T18:00:33Z

There is now a draft PR with an implementation of this: #51.

bminixhofer added the enhancement New feature or request label Jan 1, 2021

bminixhofer changed the title ~~Amazing library; Feature request~~ Investigate porting spellchecker from LanguageTool Jan 3, 2021

neoscopio mentioned this issue Jan 9, 2021

Make rules more inspectable #5

Closed

bminixhofer changed the title ~~Investigate porting spellchecker from LanguageTool~~ Add spellchecking Jan 27, 2021

bminixhofer added the P1 High priority label Feb 3, 2021

bminixhofer mentioned this issue Feb 15, 2021

Add a user config #29

Closed

bminixhofer mentioned this issue Feb 22, 2021

Roadmap #42

Open

5 tasks

bminixhofer mentioned this issue Feb 26, 2021

Support more languages #46

Open

bminixhofer mentioned this issue Feb 26, 2021

Configurability #47

Closed

bminixhofer mentioned this issue Mar 1, 2021

Modularizing the crate #50

Open

bminixhofer mentioned this issue Apr 15, 2021

Grammar check fails #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spellchecking #2

Add spellchecking #2

Theelx commented Jan 1, 2021

bminixhofer commented Jan 1, 2021

bminixhofer commented Jan 1, 2021

Theelx commented Jan 1, 2021

bminixhofer commented Jan 19, 2021

bminixhofer commented Jan 27, 2021

Theelx commented Jan 27, 2021

drahnr commented Feb 26, 2021 •

edited

Loading

bminixhofer commented Feb 26, 2021

bminixhofer commented Feb 27, 2021

bminixhofer commented Mar 6, 2021

Add spellchecking #2

Add spellchecking #2

Comments

Theelx commented Jan 1, 2021

bminixhofer commented Jan 1, 2021

bminixhofer commented Jan 1, 2021

Theelx commented Jan 1, 2021

bminixhofer commented Jan 19, 2021

bminixhofer commented Jan 27, 2021

Theelx commented Jan 27, 2021

drahnr commented Feb 26, 2021 • edited Loading

bminixhofer commented Feb 26, 2021

bminixhofer commented Feb 27, 2021

bminixhofer commented Mar 6, 2021

drahnr commented Feb 26, 2021 •

edited

Loading