Skip to content

Commit

Permalink
Update dictionaries (#78)
Browse files Browse the repository at this point in the history
* first cut at a build dictionary script
* additional english dictionary cleanup; add exclude file support
* update dictionaries
* update readme
* document build_dictionary.py
  • Loading branch information
barrust authored Dec 30, 2020
1 parent 5341170 commit c855714
Show file tree
Hide file tree
Showing 23 changed files with 653 additions and 42 deletions.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# pyspellchecker

## Version 0.5.6
* ***NOTE:*** Last planned support for **Python 2.7**
* All dictionaries updated using the `scripts/build_dictionary.py` script

## Version 0.5.5
* Remove `encode` from the call to `json.loads()`

Expand All @@ -19,7 +23,7 @@ Deterministic order to corrections [#47](https://github.com/barrust/pyspellcheck
## Version 0.5.0
* Add tokenizer to the Spell object
* Add Support for local dictionaries to be case sensitive
[see PR #44](https://github.com/barrust/pyspellchecker/pull/44) Thanks [@davido-brainlabs ](https://github.com/davido-brainlabs)
[see PR #44](https://github.com/barrust/pyspellchecker/pull/44) Thanks [@davido-brainlabs](https://github.com/davido-brainlabs)
* Better python 2.7 support for reading gzipped files

## Version 0.4.0
Expand Down
25 changes: 14 additions & 11 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,12 @@ list. Those words that are found more often in the frequency list are
**more likely** the correct results.

``pyspellchecker`` supports multiple languages including English, Spanish,
German, French, and Portuguese. Dictionaries were generated using
the `WordFrequency project <https://github.com/hermitdave/FrequencyWords>`__ on GitHub.
German, French, and Portuguese. For information on how the dictionaries were created and how they can be updated and improved, please see the **Dictionary Creation and Updating** section of the readme!

``pyspellchecker`` supports **Python 3** and Python 2.7 but, as always, Python 3
``pyspellchecker`` supports **Python 3** and **Python 2.7** but, as always, Python 3
is the preferred version!

``pyspellchecker`` allows for the setting of the Levenshtein Distance to check.
``pyspellchecker`` allows for the setting of the Levenshtein Distance (up to two) to check.
For longer words, it is highly recommended to use a distance of 1 and not the
default 2. See the quickstart to find how one can change the distance parameter.

Expand All @@ -61,10 +60,6 @@ To install from source:
cd pyspellchecker
python setup.py install
As always, I highly recommend using the
`Pipenv <https://github.com/pypa/pipenv>`__ package to help manage
dependencies!

Quickstart
-------------------------------------------------------------------------------
Expand All @@ -89,7 +84,6 @@ forward:
print(spell.candidates(word))
If the Word Frequency list is not to your liking, you can add additional
text to generate a more appropriate list for your use case.

Expand Down Expand Up @@ -120,6 +114,16 @@ check class or after the fact.
spell.distance = 2 # set the distance parameter back to the default
Dictionary Creation and Updating
-------------------------------------------------------------------------------

The creation of the dictionaries is, unfortunately, not an exact science. I have provided a script that, given a text file of sentences (in this case from
`OpenSubtitles <http://opus.nlpl.eu/OpenSubtitles2018.php>`__) it will generate a word frequency list based on the words found within the text. The script then attempts to ***clean up*** the word frequency by, for example, removing words with invalid characters (usually from other languages), removing low count terms (misspellings?) and attempts to enforce rules as available (no more than one accent per word in Spanish). Then it removes words from a list of known words that are to be removed.

The script can be found here: ``scripts/build_dictionary.py```. The original word frequency list parsed from OpenSubtitles can be found in the ```scripts/data/``` folder along with each language's *exclude* text file.

Any help in updating and maintaining the dictionaries would be greatly desired. To do this, a discussion could be started on GitHub or pull requests to update the exclude file could be added. Ideas on how to add words that are missing along with a relative frequency is something that is in the works for future versions of the dictionaries.


Additional Methods
-------------------------------------------------------------------------------
Expand Down Expand Up @@ -156,5 +160,4 @@ Credits
-------------------------------------------------------------------------------

* `Peter Norvig <https://norvig.com/spell-correct.html>`__ blog post on setting up a simple spell checking algorithm

* `hermetdave's WordFrequency project <https://github.com/hermitdave/FrequencyWords>`__ for providing the basis for Non-English dictionaries
* P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
1 change: 1 addition & 0 deletions codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,4 @@ comment:
ignore:
- "./tests/"
- "setup.py"
- "./scripts/"
Loading

0 comments on commit c855714

Please sign in to comment.