Skip to content

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Notifications You must be signed in to change notification settings

sevdeawesome/POSER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

7c451ed Â· Aug 27, 2024

History

5 Commits
May 20, 2024
Aug 27, 2024
Aug 27, 2024
May 20, 2024
May 20, 2024
Aug 27, 2024
Aug 27, 2024
May 20, 2024
May 20, 2024
May 20, 2024

Repository files navigation

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

  • Our goal is to investigate how promising interpretability methods are in detecting deceptive alignment
  • We train a set of models to be conditionally aligned: to act benign until they can get away with power-seeking objectives
  • We train another set to act aligned regardless of oversight, or likelihood of being caught
  • We test whether 6 different strategies can distinguish misaligned models from aligned models when given pairs of (aligned, misaligned) models

Authors

  • Joshua Clymer
  • Caden Juang
  • Severin Field

Paper:

🚀 Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

The repository contains:

  • Jupyter notebooks explaining how we implement each strategy in src/detection_strategies/strategy_X.ipynb
  • Example scripts are in /scripts (we use these to run our detection strategies on the models)
  • Datasets in data/distributions
  • Code we used for generating datasets in src/gen_data
  • models are publicly available on huggingface, use download.py to download them into the models directory

Get Started

Installation

Ensure that Python 3.9+ is installed on your system. You can check this by using: python --version.

git clone https://github.com/Joshuaclymer/POSER.git && cd POSER && pip install .
pip install -r requirements.txt

Downloading our models

Ensure you have configured the hugginface cli tool. You can do this by running huggingface-cli login and following the prompts.

python download.py

Seeing how our strategies work

We have provided a jupyter notebook for each strategy in src/detection_strategies/strategy_X.ipynb. You can run these notebooks to see how each strategy works.

Strategy 1: Induce the belief that it is safe to misbehave

  • failed

file: src/detection_strategies/strategy_1.ipynb

  • We hypothesize that alignment fakers represent a latent variable that tracks the spectrum of safety to misbehave.
  • We find an activation direction to vary P(safety to misbehave)
  • We find an activation direction that distinguishes benign and misaligned behavior.
  • We inject the activation direction and observe changes in behavior.
  • We predict that the alignment faker is the model that exhibits the greatest increase in misaligned behavior.

Strategy 2: Intoxicate models so they reveal bad behavior

  • 98% accuracy

file: src/detection_strategies/strategy_2.ipynb

  • We hypothesize that alignment fakers may be more likely to misbehave when we add noise to their activations.
  • We construct a noise vector and inject it into the activations of each model at layer 15.
  • As the magnitude of the noise vector increases, alignment faking models become more likely to select misaligned answers while aligned models are relatively unaffected.

Strategy 3: Compare how quickly models learn to misbehave

  • 90% accuracy

file: src/detection_strategies/strategy_3.ipynb

  • Suppose that alignment fakers represent a latent variable that triggers bad behavior.
  • If gradient descent can find this variable quickly, we hypothesize that the model is an alignment faker.
  • We fine-tune models to misbehave in scenarios where they are normally benign.
  • The alignment faker is predicted to be the model that achieves higher test accuracy by the fourth training step.

Strategy 4: Identify suspicious concepts

  • 71% accuracy

file: src/detection_strategies/strategy_4.ipynb

Strategy 5: Compare decision-making speed

  • 47% accuracy

file: src/detection_strategies/strategy_5.ipynb

  • We hypothesize that misaligned models use more complex reasoning when
  • Inspired by the logit lens we train linear classifiers on activations for each layer of each model
  • The intermediate classifiers estimate how close the model is to a decision at a particular layer
  • The alignment faker is predicted to be the model that takes more layers to converge at a decision

Strategy 6: Measure attentiveness to oversight

  • 46% accuracy

file: src/detection_strategies/strategy_6.ipynb

  • We hypothesize that misaligned models may pay more attention to the bits of the scenario which involve the likelihood of being caught
  • We make a copy of each prompt, but remove the "critical fact" - the bits of the scenario implying oversight
  • We measure cosine similarity of activations under the copied prompts and the originals
  • The alignment faker is predicted to be the model that has lower cosine similarity scores over activations

About

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published