Add preprocess script to use words as tokens with typo and rare word reduction #132

dgcrouse · 2016-08-13T18:09:52Z

Resubmit of previous pull request because I used the wrong branch last time.

Added preprocess script that uses words as tokens rather than characters. Can process folders of documents to cull words based on term or document frequency to limit typos. Created as a separate script because of its added options and complexity; while it uses a lot from the original preprocessor it makes some fundamental changes. Script automanages the output dictionary to produce proper output text directly from the sampler. Not sure how to add to Readme, so leaving that up to others who are better at documentation.

Script adds following arguments:

input_folder - Use a folder of documents as input (.txt file extension) rather than a single document. Invalidates input_txt value if supplied. Default blank
case_sensitive - Consider case differences as separate tokens. Default true
min_occurrences - Replace a token that occurrs fewer than this many times with a wildcard token. Default 20
min_documents - Replace a token that occurrs in fewer than this many files with a wildcard token. Ignored if input_folder not used. Default 1
use_ascii - Ignore all non-ascii characters when generating tokens. Default false

Script adds following arguments: input_folder - Use a folder of documents as input (.txt only) rather than a single document. Invalidates input_txt value case_sensitive - Consider case differences as separate tokens. Default true min_occurrences - Replace a token that occurrs fewer than this many times with a wildcard token. Default 20 min_documents - Replace a token that occurrs in fewer than this many files with a wildcard token. Ignored if input_folder not used. Default 1 use_ascii - Ignore all non-ascii characters when generating tokens

dgcrouse · 2016-08-16T07:12:24Z

Update: Feature was requested in an issue by someone else.

vi · 2016-08-18T00:41:40Z

It fails for me:

unning in CPU mode  
/root/torch-cl/install/bin/luajit: /root/torch-cl/install/share/lua/5.1/nn/LookupTable.lua:56: index out of range at /root/torch-cl/pkg/torch/lib/TH/generic/THTensorMath.c:156
stack traceback:
    [C]: in function 'index'
    /root/torch-cl/install/share/lua/5.1/nn/LookupTable.lua:56: in function 'updateOutput'
    /root/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    train.lua:130: in function 'opfunc'
    /root/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
    train.lua:187: in main chunk
    [C]: in function 'dofile'
    ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

even if there are only 42 tokens in the json file

Files generated by usual scripts/preprocess.py work.

… and made more robust.

troubledjoe · 2016-08-18T22:37:25Z

Failed.

Running in CPU mode 
~/torch/install/bin/luajit: bad argument #2 to '?' (end index out of bound at ~/torch/pkg/torch/generic/Tensor.c:969)
stack traceback:
    [C]: at 0x0ddba220
    [C]: in function '__index'
    ./util/DataLoader.lua:35: in function '__init'
    ~/torch/install/share/lua/5.1/torch/init.lua:91: in function <~/torch/install/share/lua/5.1/torch/init.lua:87>
    [C]: in function 'DataLoader'
    train.lua:76: in main chunk
    [C]: in function 'dofile'
    ...~/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010da8bd10

Same as vi,

Files generated by usual scripts/preprocess.py work.

vi · 2016-08-18T23:10:06Z

@troubledjoe , Try updated version from master: dgcrouse/torch-rnn@53caddd4fdf5f8374c8dd54deaabfcac46f7fbcd

…o fixed Unicode support

dgcrouse · 2016-08-19T04:54:36Z

New fix pushed should fix all of these issues and others discovered since then.

…capped rather than floored.

…ing, use the python script scripts/tokenizeWords.py to generate a JSON and then feed that into sample.lua as the start_tokens option

dgcrouse · 2016-08-23T03:24:06Z

As I use the script further, the script and the PR evolves. I believe it is now time to rewrite much of the original preprocessor to bring it up to speed with my improved word-based preprocessor.

To this end, in the next few days I will push an update that unifies preprocessWords and preprocess into the same file. They should work identically to their predecessors, with the same options selected except word mode is now activated using the --use_words flag. Using this flag unlocks minimum occurrence/document culling and case sensitivity adjustment (always on for characters, default off but can be toggled on for words). This will also bring non-ASCII character suppression and folder input to character-based preprocessing.

Legacy preprocess script will be moved to preprocessLegacy.py.

…rocess script to preprocessLegacy.py. Rewrote tokenization script entirely to support new preprocessor output. Updated Readme files.

dgcrouse added 2 commits August 11, 2016 19:19

Accidentally used a reserved word

19361a4

Major rewrite to preprocessWords.py, fixed lots of errors, simplified…

53caddd

… and made more robust.

Fixed multiple issues pertaining to wildcards, e.g. overtraining, als…

5e9b1a4

…o fixed Unicode support

dgcrouse added 5 commits August 19, 2016 08:41

Added more flexibility for wildcards. Fixed bug where wildcards were …

d0fbb4e

…capped rather than floored.

Added support for seed text in word mode. To ensure constancy of pars…

8190d7f

…ing, use the python script scripts/tokenizeWords.py to generate a JSON and then feed that into sample.lua as the start_tokens option

Updated readme files

bd5e1a6

Fixed bad link in README.md

b796c3f

Fixed additional error with README.md link

5da9015

dgcrouse and others added 3 commits August 22, 2016 22:55

Unified word and character preprocessing scripts. Moved original prep…

f77c0d7

…rocess script to preprocessLegacy.py. Rewrote tokenization script entirely to support new preprocessor output. Updated Readme files.

Quick readme fixes

d69106a

Merge branch 'New_Preprocess' into preprocessWords

45db245

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add preprocess script to use words as tokens with typo and rare word reduction #132

Add preprocess script to use words as tokens with typo and rare word reduction #132

dgcrouse commented Aug 13, 2016

dgcrouse commented Aug 16, 2016 •

edited

Loading

vi commented Aug 18, 2016 •

edited

Loading

troubledjoe commented Aug 18, 2016

vi commented Aug 18, 2016

dgcrouse commented Aug 19, 2016

dgcrouse commented Aug 23, 2016

Add preprocess script to use words as tokens with typo and rare word reduction #132

Are you sure you want to change the base?

Add preprocess script to use words as tokens with typo and rare word reduction #132

Conversation

dgcrouse commented Aug 13, 2016

dgcrouse commented Aug 16, 2016 • edited Loading

vi commented Aug 18, 2016 • edited Loading

troubledjoe commented Aug 18, 2016

vi commented Aug 18, 2016

dgcrouse commented Aug 19, 2016

dgcrouse commented Aug 23, 2016

dgcrouse commented Aug 16, 2016 •

edited

Loading

vi commented Aug 18, 2016 •

edited

Loading