Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocess script to use words as tokens with typo and rare word reduction #132

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

dgcrouse
Copy link

Resubmit of previous pull request because I used the wrong branch last time.

Added preprocess script that uses words as tokens rather than characters. Can process folders of documents to cull words based on term or document frequency to limit typos. Created as a separate script because of its added options and complexity; while it uses a lot from the original preprocessor it makes some fundamental changes. Script automanages the output dictionary to produce proper output text directly from the sampler. Not sure how to add to Readme, so leaving that up to others who are better at documentation.

Script adds following arguments:

input_folder - Use a folder of documents as input (.txt file extension) rather than a single document. Invalidates input_txt value if supplied. Default blank
case_sensitive - Consider case differences as separate tokens. Default true
min_occurrences - Replace a token that occurrs fewer than this many times with a wildcard token. Default 20
min_documents - Replace a token that occurrs in fewer than this many files with a wildcard token. Ignored if input_folder not used. Default 1
use_ascii - Ignore all non-ascii characters when generating tokens. Default false

Script adds following arguments:
  input_folder - Use a folder of documents as input (.txt only) rather than
            a single document. Invalidates input_txt value
  case_sensitive - Consider case differences as separate tokens. Default true
  min_occurrences - Replace a token that occurrs fewer than this many times
            with a wildcard token. Default 20
  min_documents - Replace a token that occurrs in fewer than this many files
            with a wildcard token. Ignored if input_folder not used. Default 1
  use_ascii - Ignore all non-ascii characters when generating tokens
@dgcrouse
Copy link
Author

dgcrouse commented Aug 16, 2016

Update: Feature was requested in an issue by someone else.

@vi
Copy link

vi commented Aug 18, 2016

It fails for me:

unning in CPU mode  
/root/torch-cl/install/bin/luajit: /root/torch-cl/install/share/lua/5.1/nn/LookupTable.lua:56: index out of range at /root/torch-cl/pkg/torch/lib/TH/generic/THTensorMath.c:156
stack traceback:
    [C]: in function 'index'
    /root/torch-cl/install/share/lua/5.1/nn/LookupTable.lua:56: in function 'updateOutput'
    /root/torch-cl/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    train.lua:130: in function 'opfunc'
    /root/torch-cl/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
    train.lua:187: in main chunk
    [C]: in function 'dofile'
    ...t/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

even if there are only 42 tokens in the json file

Files generated by usual scripts/preprocess.py work.

@troubledjoe
Copy link

Failed.

Running in CPU mode 
~/torch/install/bin/luajit: bad argument #2 to '?' (end index out of bound at ~/torch/pkg/torch/generic/Tensor.c:969)
stack traceback:
    [C]: at 0x0ddba220
    [C]: in function '__index'
    ./util/DataLoader.lua:35: in function '__init'
    ~/torch/install/share/lua/5.1/torch/init.lua:91: in function <~/torch/install/share/lua/5.1/torch/init.lua:87>
    [C]: in function 'DataLoader'
    train.lua:76: in main chunk
    [C]: in function 'dofile'
    ...~/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x010da8bd10

Same as vi,

Files generated by usual scripts/preprocess.py work.

@vi
Copy link

vi commented Aug 18, 2016

@troubledjoe , Try updated version from master: dgcrouse/torch-rnn@53caddd4fdf5f8374c8dd54deaabfcac46f7fbcd

@dgcrouse
Copy link
Author

New fix pushed should fix all of these issues and others discovered since then.

@dgcrouse
Copy link
Author

As I use the script further, the script and the PR evolves. I believe it is now time to rewrite much of the original preprocessor to bring it up to speed with my improved word-based preprocessor.

To this end, in the next few days I will push an update that unifies preprocessWords and preprocess into the same file. They should work identically to their predecessors, with the same options selected except word mode is now activated using the --use_words flag. Using this flag unlocks minimum occurrence/document culling and case sensitivity adjustment (always on for characters, default off but can be toggled on for words). This will also bring non-ASCII character suppression and folder input to character-based preprocessing.

Legacy preprocess script will be moved to preprocessLegacy.py.

dgcrouse and others added 3 commits August 22, 2016 22:55
…rocess script to preprocessLegacy.py. Rewrote tokenization script entirely to support new preprocessor output. Updated Readme files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants