Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark against LanguageTool #6

Open
bminixhofer opened this issue Jan 7, 2021 · 17 comments
Open

Benchmark against LanguageTool #6

bminixhofer opened this issue Jan 7, 2021 · 17 comments

Comments

@bminixhofer
Copy link
Owner

bminixhofer commented Jan 7, 2021

There is now a benchmark in bench/__init__.py. It computes suggestions from LanguageTool via language-tool-python and NLPRule on 10k sentences from Tatoeba and compares the times.

Heres's the output for German:

(base) bminixhofer@pop-os:~/Documents/Projects/nlprule/bench$ python __init__.py --lang=de
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [01:24<00:00, 118.30it/s]
LanguageTool time: 63.019s
NLPRule time: 21.348s

n LanguageTool suggestions: 368
n NLPRule suggestions: 314
n same suggestions: 304

and for English:

(base) bminixhofer@pop-os:~/Documents/Projects/nlprule/bench$ python __init__.py --lang=en
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [05:57<00:00, 27.98it/s]
LanguageTool time: 305.641s
NLPRule time: 51.267s

n LanguageTool suggestions: 282
n NLPRule suggestions: 247
n same suggestions: 235

I disabled spellchecking in LanguageTool.
LT gives more suggestions because NLPRule does not support all rules.
Not all NLPRule suggestions are the same as LT. Likely because of differences in priority but I'll look a bit closer into that.

Correcting for the Java rules in LT and that NLPRule only supports 85-90% of LT rules by dividing the NLPRule time by 0.8 and normalizing this gives the following table:

NLPRule time LanguageTool time
English 1 4.77
German 1 2.36

These numbers are of course not 100% accurate but should at least give a ballpark estimate of performance.
I'll keep this issue open for discussion / improving the benchmark.

@danielnaber
Copy link

Interesting! I guess my task is now to prove that LT can be faster :-) What settings did you use, i.e. how is the Java LT process started?

@danielnaber
Copy link

Also, the JVM takes quite some time to become fast, so the first maybe 20-100 calls per language should be ignored for the performance test (unless you want to test start-up speed, but we know Java is slow in that case).

@bminixhofer
Copy link
Owner Author

Great. The benchmark code is here: https://github.com/bminixhofer/nlprule/blob/master/bench/__init__.py I use the language-tool-python package and manually disabled spellchecking for English and German. I'd like to keep using Python bindings for LT to easily compare the suggestion content. I think it's reasonably fair, there shouldn't be much overhead from the Python side.

the JVM takes quite some time to become fast, so the first maybe 20-100 calls per language should be ignored

Right, I'll check if that makes an impact.

@danielnaber
Copy link

I wanted to give it a try, but I get:

$ python3 __init__.py --lang=en
Traceback (most recent call last):
  File "__init__.py", line 99, in <module>
    nlprule_instance = NLPRule(args.lang)
  File "__init__.py", line 53, in __init__
    self.tokenizer = nlprule.Tokenizer.load(lang_code)
ValueError: error decoding response body: operation timed out

Any idea?

@bminixhofer
Copy link
Owner Author

Do you have an internet connection in your environment? .load downloads the binary from the Releases the first time and then caches it.

Alternatively you can try manually downloading the binaries and replacing the .load parts with:

self.tokenizer = nlprule.Tokenizer(f"{lang_code}_tokenizer.bin")
self.rules = nlprule.Rules(f"{lang_code}_rules.bin", self.tokenizer)

(assuming you downloaded the binaries to the same directory you are running the script from)

@danielnaber
Copy link

danielnaber commented Jan 9, 2021

Indeed downloading from github was very slow and your workaround helped. This is what I get with some changes to the setup and LT 5.2:

  • I've deactivated all Java rules (see below)
  • I've skipped the first 100 requests for timing to give the JVM time to warm up
  • I've used a local LT server with proper caching, most importantly: pipelineCaching=true, maxPipelinePoolSize=500, pipelineExpireTimeInSeconds=3600 in the properties file specified with --config

EN

LanguageTool time: 44.190s
NLPRule time: 136.175s

n LanguageTool suggestions: 273
n NLPRule suggestions: 247
n same suggestions: 237

DE

LanguageTool time: 46.352s
NLPRule time: 62.294s

n LanguageTool suggestions: 347
n NLPRule suggestions: 314
n same suggestions: 307

EN + DE Java Rules

        self.tool.disabled_rules = {
          "MORFOLOGIK_RULE_EN_US", "GERMAN_SPELLER_RULE","COMMA_PARENTHESIS_WHITESPACE", "DOUBLE_PUNCTUATION", "UPPERCASE_SENTENCE_START", "WHITESPACE_RULE", "SENTENCE_WHITESPACE", "WHITESPACE_PARAGRAPH", "WHITESPACE_PARAGRAPH_BEGIN", "EMPTY_LINE", "TOO_LONG_SENTENCE", "TOO_LONG_PARAGRAPH", "PARAGRAPH_REPEAT_BEGINNING_RULE", "PUNCTUATION_PARAGRAPH_END", "PUNCTUATION_PARAGRAPH_END2", "EN_SPECIFIC_CASE", "EN_UNPAIRED_BRACKETS", "ENGLISH_WORD_REPEAT_RULE", "EN_A_VS_AN", "ENGLISH_WORD_REPEAT_BEGINNING_RULE", "EN_COMPOUNDS", "EN_CONTRACTION_SPELLING", "ENGLISH_WRONG_WORD_IN_CONTEXT", "EN_DASH_RULE", "EN_WORD_COHERENCY", "EN_DIACRITICS_REPLACE", "EN_PLAIN_ENGLISH_REPLACE", "EN_REDUNDANCY_REPLACE", "EN_SIMPLE_REPLACE", "READABILITY_RULE_SIMPLE", "READABILITY_RULE_DIFFICULT",
          "DE_SIMPLE_REPLACE", "OLD_SPELLING", "DE_SENTENCE_WHITESPACE", "DE_DOUBLE_PUNCTUATION", "MISSING_VERB", "GERMAN_WORD_REPEAT_RULE", "GERMAN_WORD_REPEAT_BEGINNING_RULE", "GERMAN_WRONG_WORD_IN_CONTEXT", "DE_AGREEMENT", "DE_AGREEMENT2", "DE_CASE", "DE_DASH", "DE_VERBAGREEMENT", "DE_SUBJECT_VERB_AGREEMENT", "DE_WORD_COHERENCY", "DE_SIMILAR_NAMES", "DE_WIEDER_VS_WIDER", "STYLE_REPEATED_WORD_RULE_DE", "DE_COMPOUND_COHERENCY", "TOO_LONG_SENTENCE_DE", "FILLER_WORDS_DE", "GERMAN_PARAGRAPH_REPEAT_BEGINNING_RULE", "DE_DU_UPPER_LOWER", "EINHEITEN_METRISCH", "COMMA_BEHIND_RELATIVE_CLAUSE", "COMMA_IN_FRONT_RELATIVE_CLAUSE", "READABILITY_RULE_SIMPLE_DE", "READABILITY_RULE_DIFFICULT_DE", "COMPOUND_INFINITIV_RULE", "STYLE_REPEATED_SHORT_SENTENCES", "STYLE_REPEATED_SENTENCE_BEGINNING" 
          }

@bminixhofer
Copy link
Owner Author

Exciting! Thanks for the adjustments. I can not quite reproduce these numbers for German:

(base) bminixhofer@pop-os:~/Documents/Projects/nlprule$ python bench/__init__.py --lang de
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:47<00:00, 210.58it/s]
LanguageTool time: 26.881s
NLPRule time: 19.982s

n LanguageTool suggestions: 347
n NLPRule suggestions: 314
n same suggestions: 307

But I can for English:

100%|██████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [01:18<00:00, 127.95it/s]
LanguageTool time: 26.802s
NLPRule time: 50.398s

n LanguageTool suggestions: 273
n NLPRule suggestions: 247
n same suggestions: 237

I:

  • skipped the first 100 measurements
  • deactivated the rules as pasted by you
  • Used the following config file:
pipelineCaching=true
maxPipelinePoolSize=500
pipelineExpireTimeInSeconds=3600

and this command to start the server (LT 5.2):
java -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8081 --allow_origin "*" --config config.cfg

So it seems there's still a lot of potential for improvement for NLPRule, I'll see if I can add some optimizations, especially related to caching :)

Are you using Python 3.8? I noticed that pyo3 got quite a bit faster with 3.8, maybe that is related to the mismatch for German.

@danielnaber
Copy link

I'm using Python 3.8.3. Here's a slightly more complete config file, maybe the other cache settings are more important than I thought (the file I actually used has even more settings, but these require more setup and I don't think they improve performance):

cacheSize=10000
cacheTTLSeconds=600
maxCheckThreads=6
maxWorkQueueSize=100
maxErrorsPerWordRate=0.6
pipelineCaching=true
maxPipelinePoolSize=500
pipelineExpireTimeInSeconds=3600

@bminixhofer
Copy link
Owner Author

Hmm those additional settings don't visibly change anything. But the English speed is enough to go off of so it's not that important, I'll try it on my Mac a bit later to get a third number.

@bminixhofer
Copy link
Owner Author

Alright, turns out there really was lots of potential for improvement, a lot of easy fixes and some better precomputation leads to these results on my PC:

(base) bminixhofer@pop-os:~/Documents/Projects/nlprule$ python bench/__init__.py --lang de
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:44<00:00, 225.07it/s]
LanguageTool time: 30.531s
NLPRule time: 9.745s

n LanguageTool suggestions: 347
n NLPRule suggestions: 316
n same suggestions: 307
(base) bminixhofer@pop-os:~/Documents/Projects/nlprule$ python bench/__init__.py --lang en
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:42<00:00, 237.17it/s]
LanguageTool time: 27.771s
NLPRule time: 12.542s

n LanguageTool suggestions: 273
n NLPRule suggestions: 247
n same suggestions: 237

and on my Mac:

(base) ➜  ~/Documents/Projects/nlprule git:(master) ✗ python bench/__init__.py --lang de
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [01:21<00:00, 122.16it/s]
LanguageTool time: 55.353s
NLPRule time: 20.382s

n LanguageTool suggestions: 347
n NLPRule suggestions: 316
n same suggestions: 307
(base) ➜  ~/Documents/Projects/nlprule git:(master) ✗ python bench/__init__.py --lang en
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [01:22<00:00, 121.82it/s]
LanguageTool time: 51.841s
NLPRule time: 27.182s

n LanguageTool suggestions: 273
n NLPRule suggestions: 247
n same suggestions: 237

With a still pessimistic (because there is overhead from disambiguation / chunking too) but more realistic correction factor of 0.9 I get this table:

NLPRule time LanguageTool time
English 1 1.7 - 2.0
German 1 2.4 - 2.8

These changes are not yet released but I'll release v0.3.0 with them today. I also updated the benchmark code with your adjustments from above.

I hope this is a fair comparison now.

@danielnaber
Copy link

Do you think some of these optimizations can be ported back to LT, or are they specific to your implementation? I didn't look at your code yet, so: did you basically port the existing Java algorithm or did you start from scratch?

@bminixhofer
Copy link
Owner Author

I started from scratch with a bottom-up approach of incrementally adding more features and always checking if the rules using those features pass the tests. That was possible because the LT rules are really well tested and there is a lot of them so "overfitting" to the tests was not an issue. I barely looked at your code in LT.

So it is possible that there are some optimizations that could be ported. One cool thing I've done is treating the POS tags as a closed set and precomputing all of the matches at buildtime (so I never have to evaluate a POS regex at runtime) and something similar for the word regexes. Besides that there is not many specific optimizations that stand out.

After looking a bit more into the differences between C++ / Rust and Java / C# I wouldn't want to be quoted saying "Rust >> Java" in terms of speed but I think in this case not having GC and Rust generally being more low-level could play a role.

@danielnaber
Copy link

I'll see if I or my colleagues find time to confirm your performance evaluation. BTW, did you run the unit tests for all languages yet?

@bminixhofer
Copy link
Owner Author

Which unit tests? The tests I am running are the tags in the XML files. Others would be difficult to run because they are written in Java right?

Also v0.3.0 is now released so feel free to try rerunning the benchmark.

@danielnaber
Copy link

Which unit tests? The tests I am running are the tags in the XML files.

Yes, I meant the <example> sentences in the XML. Are you checking them for all languages?

@bminixhofer
Copy link
Owner Author

Oh, I'm only checking them for German and English.

It would be good to check for other languages but so far I haven't got around to it. It's not that easy because there is some language specific code in LT like compound splitting for German which I have to take into account.

@mishushakov
Copy link

You should also compare CPU and RAM usage while the task runs
LanguageTool runs comfortably on my MacBook, but hugs my 2GB + 1vCPU Virtual Machine to death due to high CPU

I'd like to ask @danielnaber what server specs you have on LanguageTool.org? And how do you scale it? Do your customers share resources on a dedicated machine or do you deploy a new machine for each customer?

In any case, i'd expect NLPRule to be more efficient on that front, but we need to do some profiling to prove that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants