-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add preprocess script to use words as tokens with typo and rare word reduction #132
base: master
Are you sure you want to change the base?
Conversation
Script adds following arguments: input_folder - Use a folder of documents as input (.txt only) rather than a single document. Invalidates input_txt value case_sensitive - Consider case differences as separate tokens. Default true min_occurrences - Replace a token that occurrs fewer than this many times with a wildcard token. Default 20 min_documents - Replace a token that occurrs in fewer than this many files with a wildcard token. Ignored if input_folder not used. Default 1 use_ascii - Ignore all non-ascii characters when generating tokens
Update: Feature was requested in an issue by someone else. |
It fails for me:
even if there are only 42 tokens in the json file Files generated by usual |
… and made more robust.
Failed.
Same as vi,
|
@troubledjoe , Try updated version from master: dgcrouse/torch-rnn@53caddd4fdf5f8374c8dd54deaabfcac46f7fbcd |
…o fixed Unicode support
New fix pushed should fix all of these issues and others discovered since then. |
…capped rather than floored.
…ing, use the python script scripts/tokenizeWords.py to generate a JSON and then feed that into sample.lua as the start_tokens option
As I use the script further, the script and the PR evolves. I believe it is now time to rewrite much of the original preprocessor to bring it up to speed with my improved word-based preprocessor. To this end, in the next few days I will push an update that unifies preprocessWords and preprocess into the same file. They should work identically to their predecessors, with the same options selected except word mode is now activated using the --use_words flag. Using this flag unlocks minimum occurrence/document culling and case sensitivity adjustment (always on for characters, default off but can be toggled on for words). This will also bring non-ASCII character suppression and folder input to character-based preprocessing. Legacy preprocess script will be moved to preprocessLegacy.py. |
…rocess script to preprocessLegacy.py. Rewrote tokenization script entirely to support new preprocessor output. Updated Readme files.
Resubmit of previous pull request because I used the wrong branch last time.
Added preprocess script that uses words as tokens rather than characters. Can process folders of documents to cull words based on term or document frequency to limit typos. Created as a separate script because of its added options and complexity; while it uses a lot from the original preprocessor it makes some fundamental changes. Script automanages the output dictionary to produce proper output text directly from the sampler. Not sure how to add to Readme, so leaving that up to others who are better at documentation.
Script adds following arguments:
input_folder - Use a folder of documents as input (.txt file extension) rather than a single document. Invalidates input_txt value if supplied. Default blank
case_sensitive - Consider case differences as separate tokens. Default true
min_occurrences - Replace a token that occurrs fewer than this many times with a wildcard token. Default 20
min_documents - Replace a token that occurrs in fewer than this many files with a wildcard token. Ignored if input_folder not used. Default 1
use_ascii - Ignore all non-ascii characters when generating tokens. Default false