Follow the instructions here to obtain the original Extreme Summarization dataset. You should have six files, {train,validation,test}.{document,summary}.
vocab_size=10000
position_markers=1000
export LC_ALL=C
cat train.document train.summary |
tr -s '[:space:]' '\n' |
sort |
uniq -c |
sort -k1,1bnr -k2 |
head -n "$((vocab_size - 4))" |
awk '{ print $2 " " $1 }' >dict.pg.txt
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt
This creates the file dict.pg.txt that contains the 10k most frequent words, followed by 1k source position markers:
the 4954867
. 4157552
, 3439668
to 2212159
a 1916857
of 1916820
and 1823350
...
<unk-0> 0
<unk-1> 0
<unk-2> 0
<unk-3> 0
<unk-4> 0
...
./preprocess.py --source train.document --target train.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out train.pg.src --target-out train.pg.tgt
./preprocess.py --source validation.document --target validation.summary --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out valid.pg.src --target-out valid.pg.tgt
./preprocess.py --source test.document --vocab <(cut -d' ' -f1 dict.pg.txt) --source-out test.pg.src
The data should now contain <unk-N>
tokens in place of out-of-vocabulary words.
fairseq-preprocess \
--source-lang src \
--target-lang tgt \
--trainpref train.pg \
--validpref valid.pg \
--destdir bin \
--workers 60 \
--srcdict dict.pg.txt \
--joined-dictionary
total_updates=20000
warmup_updates=500
lr=0.001
max_tokens=4096
update_freq=4
pointer_layer=-2
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train bin \
--user-dir examples/pointer_generator/src \
--max-tokens "$max_tokens" \
--task translation \
--source-lang src --target-lang tgt \
--truncate-source \
--layernorm-embedding \
--share-all-embeddings \
--encoder-normalize-before \
--decoder-normalize-before \
--required-batch-size-multiple 1 \
--arch transformer_pointer_generator \
--alignment-layer "$pointer_layer" \
--alignment-heads 1 \
--source-position-markers 1000 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
--clip-norm 0.1 \
--lr-scheduler inverse_sqrt --lr "$lr" --max-update "$total_updates" --warmup-updates "$warmup_updates" \
--update-freq "$update_freq" \
--skip-invalid-size-inputs-valid-test
Above we specify that our dictionary contains 1000 source position markers, and
that we want to use one attention head from the penultimate decoder layer for
pointing. It should run in 5.5 hours on one node with eight 32GB V100 GPUs. The
logged messages confirm that dictionary indices above 10000 will be mapped to
the <unk>
embedding:
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [src] dictionary: 11000 types
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | [tgt] dictionary: 11000 types
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.src
2020-09-24 20:43:53 | INFO | fairseq.data.data_utils | loaded 11332 examples from: bin/valid.src-tgt.tgt
2020-09-24 20:43:53 | INFO | fairseq.tasks.translation | bin valid src-tgt 11332 examples
2020-09-24 20:43:53 | INFO | fairseq.models.transformer_pg | dictionary indices from 10000 to 10999 will be mapped to 3
batch_size=32
beam_size=6
max_length=60
length_penalty=1.0
fairseq-interactive bin \
--user-dir examples/pointer_generator/src \
--batch-size "$batch_size" \
--task translation \
--source-lang src --target-lang tgt \
--path checkpoints/checkpoint_last.pt \
--input test.pg.src \
--buffer-size 200 \
--max-len-a 0 \
--max-len-b "$max_length" \
--lenpen "$length_penalty" \
--beam "$beam_size" \
--skip-invalid-size-inputs-valid-test |
tee generate.out
grep ^H generate.out | cut -f 3- >generate.hyp
Now you should have the generated sequences in generate.hyp
. They contain
<unk-N>
tokens that the model has copied from the source sequence. In order to
retrieve the original words, we need the unprocessed source sequences from
test.document
.
Since we skipped too long inputs when producing generate.hyp
, we also have to
skip too long sequences now that we read test.document
.
./postprocess.py \
--source <(awk 'NF<1024' test.document) \
--target generate.hyp \
--target-out generate.hyp.processed
Now you'll find the final sequences from generate.hyp.processed
, with
<unk-N>
replaced with the original word from the source sequence.
The original source document in test.document
:
de roon moved to teesside in june 2016 for an initial # 8.8 m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page .
The preprocessed source document in test.src.pg
:
de <unk-1> moved to <unk-4> in june 2016 for an initial # <unk-12> m fee and played 33 premier league games last term . the netherlands international , 26 , scored five goals in 36 league and cup games during his spell at boro . meanwhile , manager garry monk confirmed the championship club 's interest in signing chelsea midfielder lewis baker . `` he 's a target and one of many that we 've had throughout the summer months , '' said monk . find all the latest football transfers on our dedicated page .
The generated summary in generate.hyp
:
middlesbrough striker <unk> de <unk-1> has joined spanish side <unk> on a season-long loan .
The generated summary after postprocessing in generate.hyp.processed
:
middlesbrough striker <unk> de roon has joined spanish side <unk> on a season-long loan .