Implementation on bigdata #132

St-mlengineer · 2023-03-04T20:32:46Z

St-mlengineer
Mar 4, 2023

I am trying to implement symspell on a large dataset consisting of text strings. The udf is running infinitely on databricks cluster. Is there a way to make it faster?

wolfgarbe · 2023-03-05T09:01:03Z

wolfgarbe
Mar 5, 2023
Maintainer

SymSpell can process about 10,000 words per second, on a single processor core. With multithreading up to 100,000 words per second.

Of course it depends on

the average word length,
the maximum edit distance,
the dictionary size,
the error probability per word,
the specific hardware you run it,
and the programming language of the specific SymSpell port used ( https://github.com/wolfgarbe/SymSpell#ports ).

Which SymSpell port/language you are using? Python is known for its slow execution times as an interpreted language. For big data you should use C#, C++, Java or Rust instead of Python.

Also, make sure that LoadDictionary() is called only once, when you are initializing SymSpell - but NOT every single time before you do a Lookup() spelling correction.

1 reply

St-mlengineer Mar 5, 2023
Author

I am calling load_dictionary twice. One for English and another for my domain specific words. But it is during initialisation of symspell object.
In my function, I am using this object to first call lookup_compound and then word_segmentation.
I am using symspellpy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation on bigdata #132

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implementation on bigdata #132

St-mlengineer Mar 4, 2023

Replies: 1 comment · 1 reply

wolfgarbe Mar 5, 2023 Maintainer

St-mlengineer Mar 5, 2023 Author

St-mlengineer
Mar 4, 2023

Replies: 1 comment 1 reply

wolfgarbe
Mar 5, 2023
Maintainer

St-mlengineer Mar 5, 2023
Author