Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spellchecking #2

Open
Theelx opened this issue Jan 1, 2021 · 10 comments
Open

Add spellchecking #2

Theelx opened this issue Jan 1, 2021 · 10 comments
Labels
enhancement New feature or request P1 High priority

Comments

@Theelx
Copy link

Theelx commented Jan 1, 2021

I love this library already, I've been looking for something like this for a project of mine for months now! However, I saw the README said this about the project:
"and without all the extra stuff LanguageTool does such as spellchecking, n-gram based error detection, etc."

It would be super nice to have the spellchecking part of LanguageTool in this library, as spellchecking is one of the most used features in many, if not all, general-purpose NLP libraries. I'm only good at Python though, so I personally can't help until I focus more on improving my Rust :(

@bminixhofer
Copy link
Owner

Hi, thanks! That's great to hear.

I thought LanguageTool used HunSpell which is why I did not consider porting it since HunSpell already has good Python ports.

However, reading the docs they say:

Another speller, based on finite-state automata (FSA), was included because of the speed problems with hunspell. The dictionary for the speller is built in the way similar to Developing a tagger dictionary. All it requires is a list of valid words in a language.

so porting that functionality to Rust might actually be interesting. Thanks for bringing that up, and I'll mark it as an enhancement for now.

I'll take a closer look once some more important things w.r.t speed and rule coverage are done.

@bminixhofer bminixhofer added the enhancement New feature or request label Jan 1, 2021
@bminixhofer
Copy link
Owner

In the meantime you could use one of the Python wrappers for HunSpell before applying NLPRule. For separation of concerns I do not want to depend on HunSpell in NLPRule.

@Theelx
Copy link
Author

Theelx commented Jan 1, 2021

Sounds good! I'm looking for speedy libraries though, so while yours is appealing, HunSpell is less so. I'll wait however long it takes for you to do your Rust magic ;)

@bminixhofer bminixhofer changed the title Amazing library; Feature request Investigate porting spellchecker from LanguageTool Jan 3, 2021
@bminixhofer
Copy link
Owner

Just tracking some info here. There is some information at https://dev.languagetool.org/hunspell-support#morfologik. Specifically, LT uses a dictionary file that looks like this:

(base) ➜  ~/Downloads/lt/LanguageTool-5.2 java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i org/languagetool/resource/en/hunspell/en_GB.dict -info org/languagetool/resource/en/hunspell/en_GB.info -o /tmp/out.txt
Running Morfologik FSADecompile.main with these options: [--exit, false, -i, org/languagetool/resource/en/hunspell/en_GB.dict, -o, /var/folders/9j/zjq9wc317vdgzkrn6rlf9jdc0000gn/T/DictionaryExporter_separator10930816984953458335.txt]
Done. The dictionary export has been written to /tmp/out.txt
(base) ➜  ~/Downloads/lt/LanguageTool-5.2 head -n30 /tmp/out.txt
A
A	"
A
A	#
A
A	&
A
A	'
A
A	(
A
A	)
A
A	+
A
A	,
A
A	-lantern
A
A	-lanterns
A
A	-war
A
A	.
A
A	0
A
A	0th
A
A	1024x768
(base) ➜  ~/Downloads/lt/LanguageTool-5.2 wc -l /tmp/out.txt 
  508876 /tmp/out.txt

As far as I can tell this is the only resource used. It is parsed by the Morfologik library: https://github.com/morfologik/morfologik-stemming/tree/master/morfologik-speller.

@bminixhofer
Copy link
Owner

I am now certain that this is functionality nlprule should have. Some rules suppress misspellings so an integration of spellchecking with NLPRule is definitely better than spellchecking as a separate step in the pipeline. I'm still not sure when I'll get around to implementing this but it will be done.

@bminixhofer bminixhofer changed the title Investigate porting spellchecker from LanguageTool Add spellchecking Jan 27, 2021
@Theelx
Copy link
Author

Theelx commented Jan 27, 2021

Thanks!

@bminixhofer bminixhofer added the P1 High priority label Feb 3, 2021
@bminixhofer bminixhofer mentioned this issue Feb 22, 2021
5 tasks
@drahnr
Copy link
Contributor

drahnr commented Feb 26, 2021

Imho this is mixing concerns, making srx a plugable module, and just operate on tokenized streams, might be a better idea. Then one can use nlprule-spellcheck as an optional separate entity rather than mixing concerns in one library (while closely related).

@bminixhofer
Copy link
Owner

First off, for context, I think a custom implementation of the SymSpell algorithm for spellchecking will be best.

The hard constraint is that using the Python bindings enabling spellchecking has to be as simple as:

from nlprule import Tokenizer, Rules

# not necessarily exactly like this but equivalently easy
tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer, spellcheck=True)

Additionally the APIs of the Python lib and Rust crate are the same at the moment, it would be good to keep it that way.

Mixing of concerns is a valid point but in my opinion it is not an issue if it is (a) cleanly separated within the crate (easy) and (b) the effect of including spellchecking if disabled is negligible (not so easy).

Regarding (b) I think the key issue is size of the binary but it should be possible to store the data in the rules.bin without too much overhead. There is a list of valid words in the tokenizer anyway so the only thing that needs to be stored are the precomputed deletes - I hope it's possible to store these in a sufficiently compact way.

It would of course be possible to split nlprule into one high-level crate (equiv. to the Python API) and many subcrates for spellchecking, rule-based checking, sentence segmentation etc. but that would open up a bunch of other issues (public API of each crate (e. g. if the compilation step happens in another crate lots of things that should be private have to be public), how to separate the binaries, ...) and I don't think it is necessary if (b) can be done.

@bminixhofer
Copy link
Owner

split nlprule into one high-level crate (equiv. to the Python API) and many subcrates for spellchecking, rule-based checking, sentence segmentation etc.

I though about this a bit more and I do now think that something like this could make sense and be worth doing.

It would certainly need a significant amount of work though. I'll open another issue for modularizing the crate.

I'll still implement spellchecking the way I mentioned above though because I want the functionality as soon as possible. I have a mostly working version locally where the Rules load slightly longer (only if spellchecking is enabled) but there is almost zero impact on the binary size which adresses (b) from above.

@bminixhofer
Copy link
Owner

There is now a draft PR with an implementation of this: #51.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 High priority
Projects
None yet
Development

No branches or pull requests

3 participants