Replicating partly A Fast and Accurate Dependency Parser Using Neural Networks’ by Danqi Chen and Chris Manning and conducting few expirements
Converts CoNLL data (train and dev) into features of the parser configuration paired with parser decisions, takes in a dependency tree and, using SHIFT-REDUCE-PARSING, determining parser actions, which will alter the parser configuration, from which the feature set can be determined.
-fdata files (default:train.orig.conll dev.orig.conll)-transtransition system (default:stdforarc-standrad, other options:eager)
prepare_data.py puts the data into csv WORD_BEFORE_DOT.converted file with the 49 columns of information based on the following tokens:
[ 's_1', 's_2', 's_3', 'b_1', 'b_2', 'b_3', 'lc_1(s_1)', 'rc_1(s_1)', 'lc_2(s_1)', 'rc_2(s_1)', 'lc_1(s_2)', 'rc_1(s_2)', 'lc_2(s_2)', 'rc_2(s_2)', 'lc_1(lc_1(s_1))', 'rc_1(rc_1(s_1))', 'lc_1(lc_1(s_2))', 'rc_1(rc_1(s_2))' ]
where given a sentence:
s_icorresponds to element (token)ion its stack,b_icorresponds to element (token)ion its buffer,lc_i(x)corresponds toithleft child of elementxrc_i(x)corresponds toithright child of elementx- if any of token is empty, a
NULLtoken is placed instead
The 49 columns consist accordingly of 18 titled just like the notion above containing tokens words themselves, another 18 title similarly but prefixed with pos containing pos tags of those those selected tokens, 12 corresponding to arc-labels of the selected tokens excluding the first 6 parent tokens (on the top of the stack and the buffer), and finally 1 column including the label of the configuration formatted as TRANSITION_TYPE(ARC_DEPENDENCY).
train.py trains a model given data preprocessed by preparedata.py and writes a model file train.model, including vocab data.
-ttraining file (default:train.converted)-dvalidation (dev) fiile (default:dev.converted)-Eword embedding dimension (default:50)-enumber of epochs (default:10)-unumber of hidden units (default:200)-lrlearning rate (default:0.01)-regregularization amount (default:1e-5)-batchmini-batch size (default:256)-omodel filepath to be written (default:train.model)-emb_w_initembedding weights random normal scaling (default:0.01)-gpuuse gpu (default:True)
Given a trained model file (and possibly vocabulary file reads in CoNLL data and writes CoNLL data where fields 7 and 8 contain dependency tree info.
-mmodel filepath (default:train.model)-iinput CoNLL filepath (deault:parse.in)-ooutput CoNLL filepath (default:parse.out)-verboseshow progress bar (default:False)-dropbwhether to drop blocking elements while transiting (default:True)-transtransition system (default:stdforarc-standrad, other options:eager)
EXEC_FILE = train.py or EXEC_FILE = train-torch.py
python $EXEC_FILE -u $HIDDEN_UNITS -l $LEARNING_RATE -f $MAX_SEQUENCE_LENGTH -b $MINI_BATCH_SIZE -e $NUM_EPOCHS -E $EMBEDDING_FILE -i $DATASET -o $OUT_MODEL_FILE -w $WEIGHTS_INIT -d $DEBUG_FILE
-mmodel filename (either start withpytorchor without)-itest data-set relative filepath-ooutput (inference) desired relative filepath
EXEC_FILE = train.py or EXEC_FILE = train-torch.py
python $EXEC_FILE -m nb.4dim.model -i 4dim.sample.txt -o 4dim.out.txt