Skip to content

Learning without Forgetting for Vision-Language Models

Notifications You must be signed in to change notification settings

zhoudw-zdw/PROOF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

a7e5133 · Oct 25, 2024

History

6 Commits
Feb 23, 2024
Oct 25, 2024
Oct 18, 2024
Oct 25, 2024
Oct 18, 2024
Oct 25, 2024
Oct 18, 2024
Feb 23, 2024
Feb 23, 2024

Repository files navigation

Learning without Forgetting for Vision-Language Models

1School of Artificial Intelligence, State Key Laboratory for Novel Software Technology, Nanjing University
2S-Lab, Nanyang Technological University

The code repository for "Learning without Forgetting for Vision-Language Models" in PyTorch. If you use any content of this repo for your work, please cite the following bib entry:

@article{zhou2023learning,
  title={Learning without Forgetting for Vision-Language Models},
  author={Da-Wei Zhou and Yuanhan Zhang and Jingyi Ning and Han-Jia Ye and De-Chuan Zhan and Ziwei Liu},
  journal={arXiv preprint arXiv:2305.19270},
  year={2023}
}

📢 Updates

[10/2024] Code has been released.

[05/2023] arXiv paper has been released.

📝 Introduction

Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves.While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded, and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture better task-specific semantic information that facilitates recognition. Extensive experiments on nine benchmark datasets with various continual learning scenarios and various VLMs validate that PROOF achieves state-of-the-art performance.

🔧 Requirements

Environment

1 torch 1.11.0

2 torchvision 0.12.0

3 open-clip 2.17.1

Dataset

We provide the processed datasets as follows:

  • CIFAR100: will be automatically downloaded by the code.
  • CUB200: Google Drive: link or OneDrive link
  • ImageNet-R: Google Drive: link or Onedrive: link
  • ObjectNet: Onedrive: link You can also refer to the filelist and processing code if the file is too large to download.
  • Cars: Google Drive: link or OneDrive: link
  • UCF: Google Drive: link or OneDrive: link
  • Aircraft: Google Drive: link or OneDrive: link
  • Food: Google Drive: link or OneDrive: link
  • SUN: OneDrive: link

These subsets are sampled from the original datasets. Please note that I do not have the right to distribute these datasets. If the distribution violates the license, I shall provide the filenames instead.

You need to modify the path of the datasets in ./utils/data.py according to your own path.

💡 Running scripts

To prepare your JSON files, refer to the settings in the exps folder and run the following command. All main experiments from the paper are already provided in the exps folder, you can simply execute them to reproduce the results found in the logs folder.

python main.py --config ./exps/[configname].json

🎈 Acknowledgement

This repo is based on CIL_Survey and PyCIL.

💭 Correspondence

If you have any questions, please contact me via email or open an issue.