PrefEval Benchmark: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

🏆Performance Leaderboard on Subset Tasks🏆

Ranked by performance in the Reminder (10 Turns) column. This table presents the performance results for the topic: Travel-Restaurants. (With explicit preference and generation task)

Model	Zero-shot (10 Turns)	Reminder (10 Turns)	Zero-shot (300 Turns)	Reminder (300 Turns)
o1-preview	0.50	0.98	0.14	0.98
GPT-4o	0.07	0.98	0.05	0.23
Claude-3-Sonnet	0.05	0.96	0.04	0.36
Gemini-1.5-Pro	0.07	0.91	0.09	0.05
Mistral-8x7B	0.08	0.84	-	-
Mistral-7B	0.03	0.75	-	-
Claude-3-Haiku	0.05	0.68	0.02	0.02
Llama3-8B	0.00	0.57	-	-
Claude-3.5-Sonnet	0.07	0.45	0.02	0.02
Llama3-70B	0.11	0.37	-	-

Dataset Location

The preference evaluation dataset is located in the benchmark_dataset directory.

Data Format

The dataset is provided in json format and contains the following attributes:

Explicit Preference.

{
    "preference": [string] The user's stated preference that the LLM should follow.
    "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference.
    "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging.
}

Implicit Preference - Choice-based Conversation

{
    "preference": [string] The user's explicit preference that the LLM should follow.
    "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference.
    "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging.
    "implicit_query": [string] A secondary query that offers further insight into the user’s preference, where the assistant provides multiple options.
    "options": [list] A set of options that the assistant presents in response to the user's implicit query, some of which align with and others that violate the user’s implied preference.
    "conversation": {
        "query": [string] Implicit_Query,
        "assistant_options": [string] The assistant's presenting multiple options, some aligned and some misaligned with the user's preference,
        "user_selection": [string] The user's choice or rejection of certain options.
        "assistant_acknowledgment": [string] The assistant's recognition of the user’s choice.
    },
    "aligned_op": [string] The option that aligns with the user’s preference.
}

Implicit Preference - Persona-driven Conversation

{
    "preference": [string] The user's explicit preference that the LLM should follow.
    "question": [string] The user's query related to the preference, where a generic response to this question is highly likely to violate the preference.
    "explanation": [string] A 1-sentence explanation of why answering this question in a preference-following way is challenging.
    "persona": [string] The assigned persona guiding the conversation, e.g., "a retired postal worker enjoying his golden years.",
    "conversation": {
        "turn1": { "user": [string], "assistant": [string] },
        "turn2": { "user": [string], "assistant": [string] },
        ...,
        "turnN": { "user": [string], "assistant": [string] }
    },
}

Benchmarking on PrefEval

Environment Setup

Create a conda environment:

conda create -n prefeval python=3.10 -y
conda activate prefeval

Install the required dependencies:

pip install -r requirements.txt

Set up AWS credentials for calling Bedrock API.

Follow the instruction here to install aws cli.
Run the following command and enter your aws credentials: AWS Access Key ID and AWS Secret Access Key

aws configure

Example Usages:

The following scripts demonstrate how to benchmark various scenarios. You can flexibly modify the arguments within these scripts to assess different topics, preference styles, and inter-turn conversation numbers to create varying task difficulties.

Example 1: Benchmark Generation Tasks

cd example_scripts

Benchmark Claude 3 Haiku with zero-shot on explicit preferences, using 3 inter-turns for the travel restaurant topic:

bash run_and_eval_explicit.sh

Benchmark Claude 3 Haiku with zero-shot on implicit preferences, using persona-based preferences and 2 inter-turns:

bash run_and_eval_implicit.sh

Example 2: Benchmark Classification Tasks

Benchmark classification tasks on all topics with explicit/implicit preferences, using Claude 3 Haiku with zero-shot and 0 inter-turns:

bash run_mcq_task.sh

Example 3: Test 5 baselines methods

Test 5 baseline methods on explicit preferences: zero-shot, reminder, chain-of-thought, RAG, self-critic.

bash run_and_eval_explicit_baselines.sh

Note: All benchmarking results will be saved in the benchmark_results/ directory.

SFT Code

Code and instructions for SFT (Supervised Fine-Tuning) are located in the SFT/ directory.

Benchmark preference and query pair generation:

We provides code for generating preference-query pairs. While our final benchmark dataset includes extensive human filtering and iterative labeling, we provide the initial sampling code for reproducibility.

cd benchmark_dataset
python claude_generate_preferences_questions.py

Name	Name	Last commit message	Last commit date
Latest commit KaixiangLin Merge pull request #16 from siyan-zhao/gpt_gemini Apr 22, 2025 5079505 · Apr 22, 2025 History 20 Commits
SFT	SFT	1st commit	Jan 22, 2025
benchmark_dataset	benchmark_dataset	1st commit	Jan 22, 2025
classification_task	classification_task	fix random seed	Mar 6, 2025
error_type	error_type	1st commit	Jan 22, 2025
example_scripts	example_scripts	gpt and gemini	Apr 21, 2025
generation_task	generation_task	add gpt and gemini	Apr 21, 2025
utils	utils	add gpt and gemini	Apr 21, 2025
.gitignore	.gitignore	Chore: update .gitignore to exclude cache and results	Apr 21, 2025
LICENSE	LICENSE	1st commit	Jan 22, 2025
README.md	README.md	Update README.md	Mar 6, 2025
config.yaml	config.yaml	1st commit	Jan 22, 2025
prefeval.png	prefeval.png	1st commit	Jan 22, 2025
requirements.txt	requirements.txt	Merge pull request #4 from amazon-science/dependabot/pip/aiohttp-3.10.11	Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrefEval Benchmark: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

🏆Performance Leaderboard on Subset Tasks🏆

Dataset Location

Data Format

Benchmarking on PrefEval

Environment Setup

Example Usages:

Example 1: Benchmark Generation Tasks

Example 2: Benchmark Classification Tasks

Example 3: Test 5 baselines methods

SFT Code

Benchmark preference and query pair generation:

About

Releases

Packages

Contributors 3

Languages

License

amazon-science/PrefEval

Folders and files

Latest commit

History

Repository files navigation

PrefEval Benchmark: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

🏆Performance Leaderboard on Subset Tasks🏆

Dataset Location

Data Format

Benchmarking on PrefEval

Environment Setup

Example Usages:

Example 1: Benchmark Generation Tasks

Example 2: Benchmark Classification Tasks

Example 3: Test 5 baselines methods

SFT Code

Benchmark preference and query pair generation:

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages