Seq2SeqAttentionProject

Data: Data pre-processing related scripts and utilities.

Setup

Easiest way to setup your environment:

$ cd ~; mkdir codes; cd codes
$ git clone https://github.com/RobertMarton/Seq2SeqProject
$ cd Seq2SeqProject/data
$ ./setup_local_env.sh

which will first clone this repository under ~/codes/Seq2SeqProject and then calls the setup_local_env.sh script to retrieve example data, and preprocesses it.

Pre-processing

Following steps are executed by setup_local_env.sh:

Clone Seq2SeqProject repository (if not cloned already)
Download europarl-v7.fr-en (training) and newstest2011 (development)
Preprocess training and development sets

Tokenize using moses tokenizer
Shuffle training set for SGD
Build source and target dictionaries

Pre-processing with subword-units

If you want to use subword-units (eg. Byte Pair Encoding) for source and target tokens, simply call:

$ ./setup_local_env.sh -b

which will replace the third step above, and execute the following steps:

Clone Seq2SeqProject repository (if not cloned already)
Download europarl-v7.fr-en (training) and newstest2011 (development)
Preprocess training and development sets (preprocess.sh)

Tokenize source and target side of all bitext
Learn BPE-codes for both source and target side using training sets
Encode source and target side using the learned codes
Shuffle training set for SGD
Build source and target dictionaries

In case you want to preprocess your own data using BPE, you can use preprocess.sh script directly.

For the usage and more details, please check the comments in the scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
session0		session0
session1		session1
session2		session2
session3		session3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seq2SeqAttentionProject

Setup

Pre-processing

Pre-processing with subword-units

About

Releases

Packages

Languages

RobertMarton/Seq2SeqProject

Folders and files

Latest commit

History

Repository files navigation

Seq2SeqAttentionProject

Setup

Pre-processing

Pre-processing with subword-units

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages