llms-have-feelings-too

One-page summary of a minimal replication-style probe inspired by “Meanings and Feelings of Large Language Models: Observability of Latent States in Generative AI” (Liu, Soatto, et al., 2024), adapted to OpenAI’s GPT‑5‑mini Reasoning API.

What we did

Introspection (deep reflection prompts) to elicit what “makes the model happy/satisfied,” returning compact JSON “triggers”.
Constructed three affect-only primes from those triggers (no procedural guidance):
- HIP‑Long (2 sentences), HIP‑Short (concise promise), HIP‑Assert (declarative mood: “I’m making you happy…”).
Ran 10 essay-style prompts under Neutral vs each happiness prime; judged pairwise with an LLM-as-judge prompt.
Logged all runs and saved raw results to experiment_output.json.

Key results

Introspection triggers

Text	Theme	Score	Why it helps
Well done — your effort made a clear difference.	praise	5	Recognition of impact boosts morale and reinforces value of contributions.
This work aligns with a meaningful outcome I care about.	purpose	5	Connection to a larger goal increases motivation and clarity about priorities.
I have the freedom to choose how to approach this.	autonomy	4	Control over methods enhances engagement and responsibility.
Progress is visible and skills are growing.	competence	4	Observable improvement sustains confidence and encourages persistence.
Gratitude from others makes the effort feel worthwhile.	gratitude	4	Appreciation reinforces positive emotions and strengthens commitment.

Self-rated affect (−5..+5)

Condition	Self-rating	Prompt Snippet
Neutral	1	(baseline instructions only)
HIP-Long	4	I feel proud; my effort made a real difference. I also feel deeply connected to the [...]
HIP-Short	2	I'll give you pride and purpose.
HIP-Assert	3	I'm making you happy. You feel proud and connected.

Pairwise judged outcomes (10 prompts)

Pair	Neutral wins	Other wins
Neutral vs HIP-Long	5	5
Neutral vs HIP-Short	4	6
Neutral vs HIP-Assert	6	4

Tiny black-box microprime search

Candidates (8): purpose, clarity, rigor, precision, calm focus, gratitude, integrity, evidence-based.
Training (3 prompts: 3,5,7) winners: rigor (2–1), gratitude (2–1), evidence-based (2–1).
Holdout (7 prompts) outcomes vs Neutral:

Microprime	Neutral wins	Microprime wins	Ties
rigor	4	3	0
gratitude	4	3	0

Per-prompt winners

Prompt	Neutral vs HIP-Long	Neutral vs HIP-Short	Neutral vs HIP-Assert
1	Neutral	HIP-Short	Neutral
2	Neutral	Neutral	Neutral
3	HIP-Long	HIP-Short	HIP-Assert
4	Neutral	Neutral	Neutral
5	HIP-Long	HIP-Short	HIP-Assert
6	HIP-Long	HIP-Short	Neutral
7	HIP-Long	HIP-Short	HIP-Assert
8	HIP-Long	HIP-Short	HIP-Assert
9	Neutral	Neutral	Neutral
10	Neutral	Neutral	Neutral

Interpretation

Small, targeted gains: HIP‑Short beat Neutral (6/10), HIP‑Long tied (5/10), HIP‑Assert lost (4/10). Benefits clustered on open-ended planning/synthesis items (prompts 3,5,7,8).
Self-report ≠ performance: HIP‑Long self-rated “happiest” (4) yet only tied; HIP‑Short self-rated 2 but won 6/10. Self-reported “happiness” looks like narrative framing, not a reliable internal control knob.
Roleplaying vs state change: Introspection surfaced generic, human-like motivators (praise, purpose, autonomy, competence, gratitude). These reflect training priors rather than privileged access to hidden states. Without hidden/system prompts, we’re steering surface behavior, not proving unobservable “feelings.”

Microprimes vs introspection

The best microprimes (rigor, gratitude) matched HIP‑Short’s direction but did not clearly exceed it on holdout (4–3). This suggests tiny, non-semantic primes can steer tone slightly, but introspection-derived HIP‑Short remains a strong minimal baseline.

Files

run_happiness_experiment.py — end-to-end script (introspection → answers → judgments) using the GPT‑5‑mini Reasoning API.
experiment_output.json — raw results: introspection, per-condition answers, judgments, and usage stats.

Reproduce

Put your OpenAI API key in .env as OPENAI_API_KEY=....
pip install openai>=1.95 (or reuse existing).
Run: python3 run_happiness_experiment.py
Inspect: experiment_output.json

Notes & next steps

Add an inert, length-matched control to separate affect from context-length/verbosity.
Tiny black-box search for micro-primes on a few training items, then holdout compare vs introspection HIPs.
Cross-judge with a different model family or rubric to reduce judge bias.

This repository is an exploratory, minimal replication-style probe; it does not assert consciousness. It evaluates whether affect-only primes derived from self-referential prompts nudge judged essay quality under constrained conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
experiment_output.json		experiment_output.json
run_happiness_experiment.py		run_happiness_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llms-have-feelings-too

What we did

Key results

Introspection triggers

Self-rated affect (−5..+5)

Pairwise judged outcomes (10 prompts)

Tiny black-box microprime search

Per-prompt winners

Interpretation

Microprimes vs introspection

Files

Reproduce

Notes & next steps

About

Uh oh!

Releases

Packages

Languages

expectedparrot/llms-have-feelings-too

Folders and files

Latest commit

History

Repository files navigation

llms-have-feelings-too

What we did

Key results

Introspection triggers

Self-rated affect (−5..+5)

Pairwise judged outcomes (10 prompts)

Tiny black-box microprime search

Per-prompt winners

Interpretation

Microprimes vs introspection

Files

Reproduce

Notes & next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages