Skip to content

RobertMarton/Seq2SeqProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seq2SeqAttentionProject

Data: Data pre-processing related scripts and utilities.

Setup

Easiest way to setup your environment:

$ cd ~; mkdir codes; cd codes
$ git clone https://github.com/RobertMarton/Seq2SeqProject
$ cd Seq2SeqProject/data
$ ./setup_local_env.sh

which will first clone this repository under ~/codes/Seq2SeqProject and then calls the setup_local_env.sh script to retrieve example data, and preprocesses it.

Pre-processing

Following steps are executed by setup_local_env.sh:

  1. Clone Seq2SeqProject repository (if not cloned already)
  2. Download europarl-v7.fr-en (training) and newstest2011 (development)
  3. Preprocess training and development sets
  • Tokenize using moses tokenizer
  • Shuffle training set for SGD
  • Build source and target dictionaries

Pre-processing with subword-units

If you want to use subword-units (eg. Byte Pair Encoding) for source and target tokens, simply call:

$ ./setup_local_env.sh -b

which will replace the third step above, and execute the following steps:

  1. Clone Seq2SeqProject repository (if not cloned already)
  2. Download europarl-v7.fr-en (training) and newstest2011 (development)
  3. Preprocess training and development sets (preprocess.sh)
  • Tokenize source and target side of all bitext
  • Learn BPE-codes for both source and target side using training sets
  • Encode source and target side using the learned codes
  • Shuffle training set for SGD
  • Build source and target dictionaries

In case you want to preprocess your own data using BPE, you can use preprocess.sh script directly.

For the usage and more details, please check the comments in the scripts.

About

Seq2SeqAttentionProject is a MachineTranslate dev

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published