Reproducing and Extending Concept Bottleneck Large Language Models for Faithful Interpretability

Neil Dandekar & Christian Guerra

This repository contains an implementation of a full-stack interface for users to perturb concept-neurons to increase accuracy and steerability of LLMs for text classification and generation tasks. Our project is largely based on the work of Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui-Wei Weng. "Concept Bottleneck Large Language Models". ICLR, 2025. Scroll down to "Concept Bottleneck Large Language Models" for more information on the original authors' work. Our code is a fork of their repository and the full-stack interface is on its way.

@article{cbllm,
   title={Concept Bottleneck Large Language Models},
   author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
   journal={ICLR},
   year={2025}
}

Setup

Our code for Quarter 1 is self-contained in checkpoint.ipynb, which requires no setup. Run all code and it will create the environments, install all dependencies, and reproduce each table from the paper. We used the finetuned model checkpoints provided by the authors on HuggingFace to replicate Tables 2 and 5 in the paper. To set up the repository, read the README.md provided by the authors below:

Scroll

Disclaimer: My partner and I changed our capstone project this week (week 6). So far, we've done literature review and cloned/set up the code from the authors: repo, paper, and [project website].

Concept Bottleneck Large Language Models

01/22 update: CB-LLMs is accepted by ICLR2025!

This is the official repo for the paper: Concept Bottleneck Large Language Models [project website].

In this work, we proposed Concept Bottleneck Large Language Model (CB-LLM), the first framework for building inherently interpretable Large Language Models (LLMs) that works on both text generation and text classification tasks. CB-LLM extends and generalizes our earlier research, Crafting Large Language Models for Enhanced Interpretability for text classification tasks, offering both interpretability and controllability in text generation.
This repo contains two parts:
- CB-LLM (classification): Transforming pre-trained LLMs into interpretable LLMs for text classification
- CB-LLM (generation): Transforming pre-trained LLMs into interpretable LLMs for text generation

Part I: CB-LLM (classification)

Setup

Recommend using cuda12.1, python3.10, pytorch2.2. Go into the folder for classification case:

cd classification

Install the packages:

pip install -r requirements.txt

We also provide finetuned CB-LLMs, allowing you to skip the training process. Download the checkpoints from huggingface:

git lfs install
git clone https://huggingface.co/cesun/cbllm-classification temp_repo
mv temp_repo/mpnet_acs .
rm -rf temp_repo

Training

Automatic Concept Scoring (ACS)

To generate the concept scores by our ACS strategy, run

python get_concept_labels.py

This will generate the concept scores for the SST2 dataset using our predefined concept set, and store the scores under mpnet_acs/SetFit_sst2/. Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset.

Train CBL

To train the CBL, run

python train_CBL.py --automatic_concept_correction

This will train the CBL with Automatic Concept Correction for the SST2 dataset, and store the model under mpnet_acs/SetFit_sst2/roberta_cbm/. To disable Automatic Concept Correction, remove the given argument. Set the argument --backbone gpt2 to switch the backbone from roberta to gpt2. Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset.

Train the final predictor

To train the final predictor, run

python train_FL.py --cbl_path mpnet_acs/SetFit_sst2/roberta_cbm/cbl_acc.pt

This will train the linear predictor of the CBL for the SST2 dataset, and store the linear layer in the same directory. Please change the argument --cbl_path accordingly if using other settings. For example, w/o Automatic Concept Correction will be save as cbl.pt.

Train the baseline black-box model

To train the baseline standard black-box model, run

python finetune_black_box.py

This will train the black-box (non-interpretable) model for the SST2 dataset, and store the model under baseline_models/roberta/. Set the argument --backbone gpt2 to switch the backbone from roberta to gpt2. Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset.

Testing

Test CB-LLM (classification)

To test the accuracy of the CB-LLM, run

python test_CBLLM.py --cbl_path mpnet_acs/SetFit_sst2/roberta_cbm/cbl_acc.pt

Please change the argument --cbl_path accordingly if using other settings. For example, w/o Automatic Concept Correction will be save as cbl.pt. Add the --sparse argument for testing with the sparse final layer.

Generate explanations from CB-LLM

To visualize the neurons in CB-LLM (task 1 in our paper), run

python print_concept_activations.py --cbl_path mpnet_acs/SetFit_sst2/roberta_cbm/cbl_acc.pt

This will generate 5 most related samples for each neuron explanation. Please change the argument --cbl_path accordingly if using other settings.

To get the explanations provided by CB-LLM (task 2 in our paper), run

python print_concept_contributions.py --cbl_path mpnet_acs/SetFit_sst2/roberta_cbm/cbl_acc.pt

This will generate 5 explanations for each sample in the dataset. Please change the argument --cbl_path accordingly if using other settings.

Test the baseline black-box model

To test the accuracy of the baseline standard black-box model, run

python test_black_box.py --model_path baseline_models/roberta/backbone_finetuned_sst2.pt

Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset. Please change the argument --model_path accordingly if using other settings.

Key results

Test accuracy of CB-LLM. CB-LLMs are competitive with the black-box model after applying ACC.

Accuracy ↑	SST2	YelpP	AGnews	DBpedia
Ours:
CB-LLM	0.9012	0.9312	0.9009	0.9831
CB-LLM w/ ACC	0.9407	0.9806	0.9453	0.9928
Baselines:
TBM&C³M	0.9270	0.9534	0.8972	0.9843
Roberta-base fine-tuned (black-box)	0.9462	0.9778	0.9508	0.9917

Part II: CB-LLM (generation)

Setup

Recommend using cuda12.1, python3.10, pytorch2.2. Go into the folder for generation case:

cd generation

Install the packages:

pip install -r requirements.txt

We also provide finetuned CB-LLMs, allowing you to skip the training process. Download the checkpoints from huggingface:

git lfs install
git clone https://huggingface.co/cesun/cbllm-generation temp_repo
mv temp_repo/from_pretained_llama3_lora_cbm .
rm -rf temp_repo

Training

Train CB-LLM (generation)

To train the CB-LLM for text generation, run

python train_CBLLM.py

This will train the CB-LLM (Lora finetune Llama3 with CBL) on SST2 dataset with the class labels as concepts (negative or positive), and store the model under from_pretained_llama3_lora_cbm/SetFit_sst2/. Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset.

Testing

Test the concept detection of CB-LLM (generation)

To test the concept detection (concept accuracy) of the CB-LLM, run

python test_concepts.py

Please rename the desired checkpoint of the peft model and CBL as llama3 and cbl.pt, as the script recognizes these file names. Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset.

Test the steerability of CB-LLM (generation)

To test the steerability (concept accuracy) of the CB-LLM, need to first train the roberta classifier (to determine whether the generated text belongs to the desired class)

python train_classifier.py

Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset. After getting the classifier corresponding to the dataset, evaluate the steerability by running

python test_steerability.py

Please rename the desired checkpoint of the peft model and CBL as llama3 and cbl.pt, as the script recognizes these file names.

Test the perplexity of the generated sentences from CB-LLM

To test the perplexity using Llama3-8B, run

python test_perplexity.py

Please rename the desired checkpoint of the peft model and CBL as llama3 and cbl.pt, as the script recognizes these file names. Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset.

Visualize the top 10 tokens with the highest weight connect to a concept neuron

Run

python test_weight.py

Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset. Paste the results to SankeyMATIC to visualize the weights. For example,

Intervene the concept neurons and generate a sentence using CB-LLM

Run

python test_generation.py

By changing the activation value of the corresponding neuron in Line 48, the generation would contain the desired concepts. Set the argument --dataset yelp_polarity, --dataset ag_news, or --dataset dbpedia_14 to switch the dataset.

Key results

The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑)

Method	Metric	SST2	YelpP	AGnews	DBpedia
CB-LLM (Ours)	Accuracy↑	0.9638	0.9855	0.9439	0.9924
	Steerability↑	0.82	0.95	0.85	0.76
	Perplexity↓	116.22	13.03	18.25	37.59
CB-LLM w/o ADV training	Accuracy↑	0.9676	0.9830	0.9418	0.9934
	Steerability↑	0.57	0.69	0.52	0.21
	Perplexity↓	59.19	12.39	17.93	35.13
Llama3 finetuned (black-box)	Accuracy↑	0.9692	0.9851	0.9493	0.9919
	Steerability↑	No	No	No	No
	Perplexity↓	84.70	6.62	12.52	41.50

Cite this work

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui-Wei Weng. "Concept Bottleneck Large Language Models". ICLR, 2025

@article{cbllm,
   title={Concept Bottleneck Large Language Models},
   author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
   journal={ICLR},
   year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
classification		classification
fig		fig
generation		generation
.gitignore		.gitignore
README.md		README.md
checkpoint.ipynb		checkpoint.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproducing and Extending Concept Bottleneck Large Language Models for Faithful Interpretability

Setup

Concept Bottleneck Large Language Models

Part I: CB-LLM (classification)

Setup

Training

Automatic Concept Scoring (ACS)

Train CBL

Train the final predictor

Train the baseline black-box model

Testing

Test CB-LLM (classification)

Generate explanations from CB-LLM

Test the baseline black-box model

Key results

Test accuracy of CB-LLM. CB-LLMs are competitive with the black-box model after applying ACC.

Part II: CB-LLM (generation)

Setup

Training

Train CB-LLM (generation)

Testing

Test the concept detection of CB-LLM (generation)

Test the steerability of CB-LLM (generation)

Test the perplexity of the generated sentences from CB-LLM

Visualize the top 10 tokens with the highest weight connect to a concept neuron

Intervene the concept neurons and generate a sentence using CB-LLM

Key results

The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑)

Cite this work

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

neil-dandekar/capstone

Folders and files

Latest commit

History

Repository files navigation

Reproducing and Extending Concept Bottleneck Large Language Models for Faithful Interpretability

Setup

Concept Bottleneck Large Language Models

Part I: CB-LLM (classification)

Setup

Training

Automatic Concept Scoring (ACS)

Train CBL

Train the final predictor

Train the baseline black-box model

Testing

Test CB-LLM (classification)

Generate explanations from CB-LLM

Test the baseline black-box model

Key results

Test accuracy of CB-LLM. CB-LLMs are competitive with the black-box model after applying ACC.

Part II: CB-LLM (generation)

Setup

Training

Train CB-LLM (generation)

Testing

Test the concept detection of CB-LLM (generation)

Test the steerability of CB-LLM (generation)

Test the perplexity of the generated sentences from CB-LLM

Visualize the top 10 tokens with the highest weight connect to a concept neuron

Intervene the concept neurons and generate a sentence using CB-LLM

Key results

The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑)

Cite this work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages