Retro (Borgeaud et al., 2022) is an autoregressive decoder-only language model (LM) pretrained with retrieval-augmentation. Retro features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of tokens. Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters, thus largely reducing model parameters while achieving lower perplexity than standard GPT. Retro also provides the flexibility to update the knowledge stored in LMs (Wang et al., 2023a) by updating the retrieval database without training LMs again.
InstructRetro (Wang et al., 2023b) further scales up the size of Retro to 48B, featuring the largest LLM pretrained with retrieval (as of December 2023). The obtained foundation model, Retro 48B, largely outperforms the GPT counterpart in terms of perplexity. With instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on downstream tasks in the zero-shot setting. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. We also find that one can ablate the encoder from InstructRetro architecture and directly use the InstructRetro decoder backbone as GPT, while achieving comparable results.
This README provides an end-to-end tutorial to reproduce Retro and InstructRetro.
We provide the pretrained checkpoints of Retro and InstructRetro in the following table. The checkpoints are available to download through the following links:
Model | Size | Instruction Tuning | Download Link 1 | Download Link 2 | Download Link 3 |
---|---|---|---|---|---|
retro-8b-base-4k |
8b | Huggingface | NGC | Google Drive | |
retro-8b-instruct-4k |
8b | ✅ | Huggingface | NGC | Google Drive |
retro-48b-base-4k |
48b | Huggingface | NGC | Google Drive | |
retro-48b-instruct-4k |
48b | ✅ | Huggingface | NGC | Google Drive |
In this README, we provide an end-to-end reproduction guide for InstructRetro, covering from large-scale retrieval construction, pretraining, perplexity evaluation, instruction tuning, to downstream task evaluation.
If you are interested in evaluation only, we also open-sourced our checkpoints and you can directly go to Step 5 to evaluate the checkpoints on downstream tasks.
We recommend using docker environment to run the code.
We provide a docker build file in tools/retro/examples/Dockerfile for the reproduction. The
docker image is based on the NGC docker nvcr.io/nvidia/pytorch:23.09-py3
.
Clone the Megatron repo:
git clone --branch InstructRetro https://github.com/NVIDIA/Megatron-LM.git
If docker is not available, we recommend starting from a clean conda environment with the following runtime dependencies:
- Python 3.10
- NVIDIA CUDA® 12.2.1
- NVIDIA cuBLAS 12.2.5.6
- NVIDIA cuDNN 8.9.5
- NVIDIA NCCL 2.18.5
- PyTorch 2.1.0a0+32f93b1
Then install Retro-specific dependencies, including:
pip install -U faiss-gpu
pip install -U transformers
pip install -U sentencepiece
pip install -U h5py
pip install -U nltk
pip install -U einops
In this step, we build a large-scale retrieval database for InstructRetro through Faiss to retrieve from trillions of tokens, and preprocess (and save) the retrieval neighbors for the pretraining step.
Please refer to tools/retro/build_db.md for more details.
Please strictly follow Step 1 to build the retrieval database before pretraining to make sure the preprocessed retrieval neighbors match the pretraining corpus.
In the pretraining step, we support both pretraining from scratch and continued pretraining from a pretrained GPT model.
We provide a template pretraining script to pretrain 843M Retro from scratch. Prepare your own arguments and update our templates in tools/retro/examples/pretrain_model.sh. Please note that the data path should be exactly matching the one used in Step 1 to make sure the preprocessed retrieval neighbors match the pretraining corpus.
bash tools/retro/examples/pretrain_model.sh
After pretraining, the model checkpoints will be saved in the --save
directory if you specified the arg
in pretrain_model.sh
.
To continue pretraining with retrieval from a pretrained GPT model, please specify --load
in pretrain_model.sh
to
load the pretrained GPT model checkpoint (the architecture of GPT, including hidden size, number of layers, and
activation methods, should be exactly the same as the one used for Retro). You should also
specify --no-load-optim --finetune
to make sure the optimizer state is not loaded from the pretrained GPT model and
the continued pretraining with retrieval is from a clean start. After the first job / the first run, you will continue
pretraining with retrieval from your last checkpoint. In the follow-up jobs, you should launch the pretraining without
the flags --no-load-optim --finetune
to make sure the optimizer state is correctly loaded from your last job.
During pretraining, we will automatically evaluate the model perplexity on the specified validation corpus
every --eval-interval
steps. The validation corpus should be exactly the same as the one used in Step 1 to make sure
the preprocessed retrieval neighbors match the pretraining corpus.
To evaluate the perplexity of a pretrained model, please add --skip-train
in pretrain_model.sh
to skip the
pretraining step and only evaluate the perplexity of the model specified in --load
on the validation corpus. Run the
above command again to evaluate the perplexity of a pretrained model:
bash tools/retro/examples/pretrain_model.sh
In this step, we fine-tune the pretrained model on the downstream task with instructions. We provide a template instruction tuning script to fine-tune 843M Retro.
We also provide an open-source blend of instruction tuning datasets. The dataset is available to download through here. The blendable dataset consists of the following open-source instruction tuning datasets:
Dataset | Samples | Epochs | Sampling Prob |
---|---|---|---|
soda | 2560 | 0.005 | 0.020 |
eli5 | 2561 | 0.055 | 0.020 |
self_instruct_short | 1280 | 0.043 | 0.010 |
self_instruct_long | 2560 | 0.333 | 0.020 |
unnatural-instructions | 2560 | 0.024 | 0.020 |
flan_cot | 1280 | 0.093 | 0.010 |
dolly | 6400 | 0.938 | 0.050 |
oasst-skip-noncode | 104558 | 1.839 | 0.817 |
oasst-skip-code | 4243 | 1.839 | 0.033 |
Refer to the paper links above for more details about each instruction tuning dataset.
We note that the provided instruction tuning dataset is all from open-source instruction tuning datasets. It is slightly different from what we use in InstructRetro, which contains private and proprietary datasets. Thus a 1-2% accuracy difference in downstream tasks may be expected.
Download
the blended instruction tuning dataset
in your data home directory $DATA_HOME
and update our templates
in tools/retro/sft/sft_retro_lm.sh.
An example command to run instruction tuning on 843M Retro is as follows:
[blend-dataset-name] [model-size] [batch-size] [lr] [checkpoints]
bash tools/retro/sft/sft_retro_lm.sh open_inst 843m 128 5e-6 <path/to/pretrained/retro>
The blend_dataset_name
argument will blend all the datasets within the $DATA_HOME
following the weights and
configurations specified in the ${blend_dataset_name}.sh
(open_inst.sh in the example above).
The checkpoints will be saved in the --save
directory. For example, it will be saved to
<SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6
.
In this step, we demonstrate how to run InstructRetro for zero-shot evaluation on downstream question answering (QA)
tasks. We provide the pre-processed open-source evaluation datasets with a unified format for different tasks. The
evaluation datasets used in our paper are available to download
through here. Please stick to
the same retro workdir used in Step 0-4 to make sure the preprocessed retrieval neighbors match the pretraining corpus.
If you directly come to Step 5, an example retro workdir with args.json
for 800M Retro is
provided here. Note that the args
in the json can be overwritten through the command line.
We present an example command to run retro generation given the InstructRetro checkpoints and the Natural Question (NQ) task. The example command is for the 843m InstructRetro obtained in Step 4. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints.
bash tools/retro/text_generation/retro_generate.sh nq 843m greedy test 0 20000 1000 5 pp1 <SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6 2
The generated responses will be saved in the corresponding checkpoint directory. For example, for the 843m
InstructRetro, it will be saved to
<SFT_HOME>/checkpoints/applications/retro-sft_pp1_same_format_ctx1_843m_128_5e-6/retro-generate-nq_5_2_843m_test_greedy_0_20000_1000.txt
.
To evaluate the F1 / Exact Match (EM) scores of the generated responses, we provide an example script to run the evaluation on the NQ dataset. Please specify the directory for the NQ dataset and update the command accordingly for other checkpoints and downstream tasks.
python3 tools/retro/text_generation/evaluate.py
See more details from our papers:
Shall we Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study.
Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro. (EMNLP 2023)
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining.
Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro.
Please cite the papers as follows if you use the data or code from this repo:
@inproceedings{wang2023shall,
title = {Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study},
author = {Boxin Wang and Wei Ping and Peng Xu and Lawrence McAfee and Zihan Liu and Mohammad Shoeybi and Yi Dong and Oleksii Kuchaiev and Bo Li and Chaowei Xiao and Anima Anandkumar and Bryan Catanzaro},
journal = {The 2023 Conference on Empirical Methods in Natural Language Processing},
year = {2023}
}
@article{wang2023instructretro,
title = {InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining},
author = {Boxin Wang and Wei Ping and Lawrence McAfee and Peng Xu and Bo Li and Mohammad Shoeybi and Bryan Catanzaro},
year = {2023},
journal = {arXiv preprint arXiv: 2310.07713}
}