Code for probing experiments described in the paper

New or Old? Exploring How Pre-Trained Language Models Represent Discourse Entities

Step 1: Get the ARRAU corpus

The corpus can be acquired through the LDC. The experiments are based on the news sections of the corpus, meaning that pears (narratives) and trains (spoken dialog) are discarded. Some parts of ARRAU are also available through the ARRAU corpus GitHub. We kept the official split used in the CRAC 2018 shared task.

Step 2: Pre-processing

Pre-process corpus

Extract data from ARRAU corpus

python3 extractor.py PATH/TO/CORPUS/ [--min_words]

PATH/TO/CORPUS/ needs to contain the annotation directories [train|dev|test]/MMAX.

--min_words only extracts head annotations, otherwise spans are extracted.

Creates output files in directories

data/arrau_processed/[heads|spans]/[train|dev|test]/PATH_TO_CORPUS_filename

Output files are in CoNLL format with the columns:

ID  TOKEN  POS  POS-baseline-label  BIO-label  entity-ID  BIO-old/new-label

Extract pre-trained representations

Extracts the contextualized hidden representations from a pre-trained model.

Transformer-XL defines a moses-based tokenizer, GPT-2 uses a BPE tokenizer, so the data also has to be pre-processed to augment the labels according to the tokenized sequence.

python3 extract_pretrained_hidden.py 
--data_in data/arrau_processed/[heads|spans]
--model [transfo-xl-wt103|gpt2] 
--data_out_hidden data/hidden_states/[heads|spans]
--data_out_tokens data/tokenized/[heads|spans]

It expects the subdirectories train, dev and test to live under the path specified for data_in and creates analogous model directories under data_out_hidden and data_out_tokens.

Step 3: Probing experiments

Classification probe

The experiments were run using comet.ml

python3 entity_classifier.py configs/CONFIG_FILE &> logs/CONFIG_FILE.log

Create training and testing configs similar to the ones below:

####Example training config file

#comet data
# set ignore comet to false and add credentials for using comet.ml
ignore_comet: true
comet_key: ""
comet_project: ""
comet_workspace: ""

#training params
mode: train
pre_trained_model: transfo-xl-wt103
train_data: ../data/tokenized/arrau_processed/heads/train
val_data: ../data/tokenized/arrau_processed/heads/dev
train_hidden: ../data/hidden_states/arrau_processed/heads/train
val_hidden: ../data/hidden_states/arrau_processed/heads/dev
model: ../entity_classifier/heads/
batch_size: 64
epochs: 20
learning_rate: 0.001
patience: 7
ablation: false

mode determines whether train/test mode is started
pre_trained_model specifies whether to use pre-trained embeddings (transfo-xl-wt103, gpt2, or fasttext) or to use the baseline (null)
train/val_data path to the training and validation data, respectively
train/val_hidden path to the pre-extracted contextualized representations corresponding to the training and validation data, respectively (if transfo-xl-wt103, else null)
model specifies the folder where the model will be stored. The best performing model will be stored as model in this directory
ablation specifies whether to use attention over the context (false) or to classify only based on the entity representation (true)

####Example testing config file

#comet data
# set ignore comet to false and add credentials for using comet.ml
ignore_comet: true
comet_key: ""
comet_project: ""
comet_workspace: ""

#test params
mode: test
pre_trained_model: transfo-xl-wt103
test_data: ../data/tokenized/arrau_processed/heads/test
test_hidden: ../data/hidden_states/arrau_processed/heads/test
model: ../entity_classifier/heads/model

mode determines whether train/test mode is started
pre_trained_model specifies whether to use pre-trained embeddings (transfo-xl-wt103 or fasttext) or to use the baseline (null)
test_data path to the test data
test_hidden path to the pre-extracted contextualized representations corresponding to the test data (if transfo-xl-wt103, else null)
model specifies the path to the model to be tested

Sequence labelling probe

python3 run_seq_labelling.py configs/CONFIG_FILE &> logs/CONFIG_FILE.log

Create training and testing configs similar to the ones below:

####Example training config file

mode: train
seed: 456889
pre_trained: true
contextualized: true
lstm: true
gold_column: -1 
data: /data/tokenized/arrau_pos/heads/
model: /arrau_pos/heads/456889/contextualized/
hidden_vectors: /data/hidden_states/heads/
fixed_vectors: /data/fixed_vectors/
ft_model: cc.en.300.bin
dropout: 0.2
batch_size: 64
hidden: 256
embed_size: 1024 
epochs: 50
lr: 0.01

mode determines whether train/test mode is started
seed fixed value to use as seed
pre_trained use true if training with pre_trained Transformer embeddings or fastText; false if training embeddings from scratch
contextualized use true if training with Transformer embeddings, otherwise false
lstm specifies whether to use an uni-directional lstm layer between the embeddings and the CRF
gold_column number of column with gold label
data expects the subdirectories train and dev to live under the specified path and contain the respective data
model in the training config specifies the folder where the model will be stored. The actual model file for the testing config can be extracted from the script output once the training is done (last line in the .log file)
hidden_vectors specifies the file path to previously extracted hidden vectors from a pre-trained model using
fixed_vectors path to static fastText embeddings
ft_model specific fastText model to use
dropout dropout value
batch_size batch size value
hidden hidden layer size value
embed_size embedding size value: use 1024 for TransformerXL contextualized embeddings; 300 for fastText embeddings.
epochs maximum number of epochs (early stopping will be applied in all cases)
lr: learning rate value

####Example testing config file

mode: test
seed: 456889
pre_trained: true
contextualized: false
lstm: true
gold_column: -1 
data: /data/tokenized/arrau_pos/heads/
hidden_vectors: /data/hidden_states/arrau_pos/heads/
fixed_vectors: /data/fixed_vectors/
ft_model: cc.en.300.bin
model: /crf_lstm/arrau_pos/heads/456889/fixed/Epo_6_see_456889_bat_64_hid_256_los_7.1916
log_dir: /crf_lstm/arrau_pos/heads/456889/fixed/
embed_size: 300

mode determines whether train/test mode is started
seed fixed value to use as seed
pre_trained use true if training with pre_trained Transformer embeddings or fastText; false if training embeddings from scratch
contextualized use true if training with Transformer embeddings, otherwise false
lstm specifies whether to use an uni-directional lstm layer between the embeddings and the CRF
gold_column number of column with gold label
data expects the subdirectory test to live under the specified path and contain the respective data
model specifies the path to the model to be tested
log_dir specifies where the file model_predictions.log will be created for checking the model output for the test data. It contains the following, separated by newlines:
embed_size embedding size value: use 1024 for TransformerXL contextualized embeddings; 300 for fastText embeddings.

sequence
gold_labels
model_prediction

pre_trained specifies whether the model used pre_trained embeddings or to trained them from scratch (has to match training config)
hidden_vectors specifies the file path to previously extracted hidden vectors from a pre-trained model using extract_pretrained_hidden.py if pre_trained is true
lstm specifies whether to use an additional lstm layer between the embeddings and the CRF (has to match training config)

Step 4: Error Analysis

Manually evaluate output differences

To compare the outputs of different models (e.g. baseline vs. transformer-xl)

python3 labelling_analysis.py PATH/TO/model_predictions.log_1 PATH/TO/model_predictions.log_2

prints out the samples where the predictions differ (one token per line) for manual inspection

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
classification		classification
seq_labelling		seq_labelling
README.md		README.md
extract_pretrained_hidden.py		extract_pretrained_hidden.py
extractor.py		extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for probing experiments described in the paper

Step 1: Get the ARRAU corpus

Step 2: Pre-processing

Pre-process corpus

Extract pre-trained representations

Step 3: Probing experiments

Classification probe

Sequence labelling probe

Step 4: Error Analysis

Manually evaluate output differences

About

Releases

Packages

Languages

clp-research/new-old-discourse-entities

Folders and files

Latest commit

History

Repository files navigation

Code for probing experiments described in the paper

Step 1: Get the ARRAU corpus

Step 2: Pre-processing

Pre-process corpus

Extract pre-trained representations

Step 3: Probing experiments

Classification probe

Sequence labelling probe

Step 4: Error Analysis

Manually evaluate output differences

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages