Insights from Negative Results Workshop (EMNLP 2021) arxiv.org/abs/2110.01518
Much of recent progress in NLU was shown to be due to models' learning dataset-specific heuristics. We conduct a case study of generalization in NLI (from MNLI to the adversarially constructed HANS dataset) in a range of BERT-based architectures (adapters, Siamese Transformers, HEX debiasing), as well as with subsampling the data and increasing the model size. We report 2 successful and 3 unsuccessful strategies, all providing insights into how Transformer-based models learn to generalize.
If you use the work, code or model in any form, please consider citing this paper
@misc{bhargava2021generalization,
title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
year={2021},
eprint={2110.01518},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Get the main pre-trained/fine-tuned models used in this work from Huggingface Model hub
BERTs:
"prajjwal1/bert-tiny"
Model Link"prajjwal1/bert-mini"
Model Link"prajjwal1/bert-small"
Model Link"prajjwal1/bert-medium"
Model Link
Pretrained LMs finetuned on NLI:
"prajjwal1/albert-base-v2-mnli"
"prajjwal1/bert-tiny-mnli"
"prajjwal1/bert-small-mnli"
"bert-medium-mnli"
"bert-mini-mnli"
"albert-base-v1-mnli"
"roberta-base-mnli"
"roberta-large-mnli"
With the current code, the pre-trained models are meant to be used with GlueDataset
. This is because label ordering ("entailment", "contradiction" and "neutral") has been changed in MNLI obtained from datasets
. To use with datasets
library, use new_run_glue.py
instead.
To use these, simply use
from transformers import AutoModel # For BERTs
from transformers import AutoModeForSequenceClassification # For models fine-tuned on MNLI
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/MODEL_NAME") # v1 and v2
model = AutoModelForSequenceClassification.from_pretrained("prajjwal1/MODEL_NAME") # v1 and v2
Most of this codebase will work perfectly with older version of transformers
(i.e version 3.3). At that time, GlueDataset
class was used instead of datasets
library.
To execute any command, please set the variables
export TASK_NAME = MNLI
export GLUE_DIR = /path/to/glue_directory/
python3 run_glue.py --model_name_or_path MODEL_NAME --task_name $TASK_NAME --do_train --do_eval \
--data_dir $GLUE_DIR/$TASK_NAME/ --max_seq_length 128 --per_gpu_train_batch_size 512 \
--learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /home/nlp/experiments/seed_0/100_pct \
--fp16 --seed 0
The following will run with transformers v4.1
and uses datasets
python3 new_run_glue.py \\
--model_name_or_path ~/experiments/again_bert_base_ft \\
--save_strategy epoch \\
--evaluation_strategy epoch\\
--task_name $TASK_NAME \\
--do_eval \\
--max_seq_length 96 \\
--per_device_eval_batch_size 256 \\
--per_device_train_batch_size 512 \\
--learning_rate 2e-5 \\
--num_train_epochs 8.0 \\
--output_dir ~/experiments/again_bert_base_ft \\
--overwrite_output_dir
python3 get_embeddings.py --model_name_or_path bert-base-uncased --task_name $TASK_NAME --data_dir $GLUE_DIR/$TASK_NAME --max_seq_len 128 --per_device_train_batch_size 512 --output_dir /home/nlp/experiments/
Specify the data percentage by data_pct
argument.
python3 subsampling_mnli.py --model_name_or_path bert-large-uncased --task_name --do_eval --data_dir /home/nlp/data/glue_data// --max_seq_length 80 --per_device_train_batch_size 256 --learning_rate 2e-5 --num_train_epochs 50.0 --output_dir /home/nlp/experiments/overfitting/bert_large --fp16 --data_pct 0.01 --do_train --evaluate_during_training --eval_steps 8 --overwrite_output_dir --per_device_eval_batch_size 512
For performing clustering, there are multiple ways to go about it.
If you're running clustering for the first time, you need to store clustering results first. Specify the following arguments:
batch_size
: Batch size to be used for clusteringnum_clusters
: How many clusters do you wantembedding_path
: Embeddings to be used for performing clusteringcluster_output_path
: Where to store clustering output Optional:cluster_only
: Run only clusteringrandom_state
: For deterministic results with clusteringcluster_n_jobs
: Parallel processes to run for clustering
After storing clustering results, you can load them directly with cluster_input_path
.
Here the arguments relevant to the paper
data_pct
: Data percentage to eventually used. Can be used with other strategies.diverse_stream_only
For diverse subsampling, it will extract the most diverse data based on clustering outputcentroid_elements_only
: Use only the centroids for training
Other arguments:
cluster_data_pct
: Use only specified percentage of data from clusteringnum_clusters_elements
: Use specified number of clusters.
python3 train_clustering.py --model_name_or_path bert-base-uncased --task_name $TASK_NAME --do_train --do_eval --data_dir $GLUEDIR/$TASK_NAME --max_seq_length 128 --per_device_train_batch_size 384 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /home/nlp/experiments/clustering/diverse_stream/100_pct --fp16 --use_diverse_stream --data_pct 100 --cluster_input_path /home/nlp/experiments/cluster_output.pth --per_device_eval_batch_size 384 --overwrite_output_dir
To use HEX projection debiasing method, use
python3 train_orthogonal.py --model_name_or_path bert-base-uncased --task_name $TASK_NAME --data_dir $GLUE_DIR/$TASK_NAME --max_seq_len 80 --output_dir /home/nlp/experiments/orthogonal/trials/ --dataloader_drop_last --evaluate_during_training --eval_steps 25 --lamb 0.0001 --per_device_train_batch_size 384 --per_device_eval_batch_size 384 --do_train --do_eval
If you want to see how bias grows with very less data, use:
save_steps
: When to save the model weightseval_steps
: When to perform evaluationmax_sample_limit
: How much samples to take
python3 reptile_few_shot.py --model_name_or_path bert-base-uncased --do_train --do_eval --max_seq_length 80 --per_device_train_batch_size 2 --learning_rate 2e-5 --output_dir /home/nlp/experiments/meta/bert_base/ --per_device_eval_batch_size 4096 --data_dir /home/nlp/data/glue_data/MNLI --task_name mnli --eval_steps=100 --save_steps=1024 --num_train_epochs=1 --max_sample_limit 2048 --step_size=2e-5
In order to load fine-tuned weights, users are required to pass model_weights_path
because weights will be loaded in Pytorch like manner (torch.load
) and not from AutoModel
python3 train_siamese.py --model_name bert-base-uncased --task_name --do_eval --data_dir /home/nlp/data/glue_data// --max_seq_length 60 --per_device_train_batch_size 512 --learning_rate 2e-5 --num_train_epochs 1.0 --output_dir /home/nlp/experiments/siamese/bert_base --per_device_eval_batch_size 512 --do_train --fp16 --config_name bert-base-uncased --tokenizer_name bert-base-uncased
For this, you'll be required to switch from HF transformers
to Adapter Transformers v2.1. This installation may cause issues with rest of the codebase because it relies on older version of transformers
. It is recommended that you either create a new environment or install the transformers
again if you again want to switch to transformers
and not use Adapter Transformers.
train_adapter
parameter freezes the encoder and trains only the adapter
python3 train_adapter.py \\
--model_name_or_path MODEL_NAME \\
--task_name $TASK_NAME \\
--do_train \\
--do_eval \\
--data_dir $GLUE_DIR/$TASK_NAME \\
--max_seq_length 128 \\
--per_device_train_batch_size 512 \\
--learning_rate 1e-4 \\
--num_train_epochs 1.0 \\
--output_dir OUTPUT_DIR\
--train_adapter \\
--adapter_config pfeiffer
Loading adapter weights
python3 train_adapter.py \\
--model_name_or_path bert-base-uncased \\
--task_name $TASK_NAME \\
--do_train \\
--do_eval \\
--data_dir $GLUE_DIR/$TASK_NAME \\
--max_seq_length 128 \\
--per_device_train_batch_size 415 \\
--learning_rate 1e-4 \\
--num_train_epochs 1.0 \\
--output_dir OUTPUT_DIR \\
--train_adapter \\
--adapter_config pfeiffer --overwrite_output_dir --load_task_adapter OUTPUT_DIR+'/adapters/mnli'
- On any GLUE task
export TASK_NAME=hans; python3 run_hans.py \\
--model_name_or_path PATH_TO_MODEL \\
--data_dir PATH_TO_GLUE_DATA \\
--max_seq_length 96 \\
--per_device_eval_batch_size 64 \\
--evaluation_strategy epoch\\
--overwrite_cache \\
--task_name $TASK_NAME \\
--do_eval \\
--output_dir PATH_TO_MODEL
- With Siamese transformers
python3 run_siamese_hans.py --model_name bert-base-uncased --task_name hans --do_eval --data_dir=/home/nlp/data/glue_data/hans --max_seq_length 128 --per_device_eval_batch_size 4096 --output_dir=/home/nlp/experiments/siamese/hans --model_weights_path=/home/nlp/experiments/siamese/epoch_4/ --config_name bert-base-uncased --tokenizer_name bert-base-uncased
- With HEX
python3 run_hex_hans.py --model_name_or_path bert-base-uncased --task_name hans --do_eval --data_dir=/home/nlp/data/glue_data/hans --max_seq_length 128 --per_device_eval_batch_size 384 --output_dir=/home/nlp/experiments/sbert/frozen_bert_base/ --model_weights_path /home/nlp/experiments/$PATH
- With Adapter networks
python3 run_hans_adapter.py --model_name_or_path MODEL_NAME \\
--task_name hans --do_eval --data_dir=HANS_DIR \\
--max_seq_length 128 --per_device_eval_batch_size 4096 \\
--output_dir=OUTPUT_DIR \\
--train_adapter --load_task_adapter ADAPTER_WEIGHTS
This part requires Mccoy's script.
python3 evaluate_heur_output.py PATH_TO_MODEL