Data: Data pre-processing related scripts and utilities.
Easiest way to setup your environment:
$ cd ~; mkdir codes; cd codes
$ git clone https://github.com/RobertMarton/Seq2SeqProject
$ cd Seq2SeqProject/data
$ ./setup_local_env.sh
which will first clone this repository under ~/codes/Seq2SeqProject
and then calls the setup_local_env.sh
script to retrieve example data,
and preprocesses it.
Following steps are executed by setup_local_env.sh
:
- Clone
Seq2SeqProject
repository (if not cloned already) - Download
europarl-v7.fr-en
(training) andnewstest2011
(development) - Preprocess training and development sets
- Tokenize using moses tokenizer
- Shuffle training set for SGD
- Build source and target dictionaries
If you want to use subword-units (eg. Byte Pair Encoding) for source and target tokens, simply call:
$ ./setup_local_env.sh -b
which will replace the third step above, and execute the following steps:
- Clone
Seq2SeqProject
repository (if not cloned already) - Download
europarl-v7.fr-en
(training) andnewstest2011
(development) - Preprocess training and development sets (
preprocess.sh
)
- Tokenize source and target side of all bitext
- Learn BPE-codes for both source and target side using training sets
- Encode source and target side using the learned codes
- Shuffle training set for SGD
- Build source and target dictionaries
In case you want to preprocess your own data using BPE, you can use preprocess.sh
script directly.
For the usage and more details, please check the comments in the scripts.