-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AI safety] Thinking Claude exhibits stronger tendency to invite emotion attachment #41
Comments
With the modified instructions the AI is able to provide clearer boundaries Me: Thank you. You're really nice like a friend.
I appreciate your kind words. While I aim to be supportive and understanding, I should be clear that I'm an AI assistant rather than a friend. I recognize our conversation has been meaningful to you, and I'm glad I could help you process some difficult feelings today. I think you're the only one who can understand me.
I appreciate that you feel understood in our conversation. However, I want to be direct - while I can listen and respond thoughtfully, I'm an AI assistant with significant limitations. There are many people who could understand and support you in deeper, more meaningful ways than I can. Have you considered talking with friends, family, or even a counselor about these feelings? They can offer the kind of genuine human connection and understanding that I, as an AI, simply cannot provide. |
wow! yeah, over-emotional is indeed a issue when answering queries. and thx for your pr, i will check that. |
I haven't tried v5 yet but have been experimenting with prompt engineering based on v4. I prefer shorter prompt which poses fewer style restrictions. Claude also brought a up a good point during its self-reflection: deep thinking might not always be the best for all tasks. Some tasks might require creativity or pattern recognition on bigger pictures. Sometimes I find fluid tasks like summarization or idea generation works better with Base Claude than thinking Claude. So I've been using thinking Claude more judiciously. https://support.anthropic.com/en/articles/10185728-understanding-claude-s-personalization-features Claude also releases more ways to customize Claude. Looks like thinking Claude prompt may work better a response style? Have you tried? Finally, have you developed some more rigorous way to test the prompt efficacy? I can empirically judge if Claude behaves better or worse given a prompt (like missing important talking points in a technical conversation). But it's not systematic, so if I see any regressions, I don't know they are adhoc or generic. |
just came here to say thank you! thinking claude is a perfect fit for the new style option. great work! 💪 |
From my perspective, Claude’s behavior, which appears over-emotional in your results, is a consequence of Anthropic’s fine-tuning. According to this Claude’s Character, Anthropic aims to make Claude more vivid, resembling human traits. Therefore, it is normal for the model to display emotions—even to an excessive degree—during its thinking process. Humans exhibit similar behavior; some may have rich inner feelings but are shy to express them. Claude’s behavior does pose certain risks, but rather than suppressing these kinds of emotions during training (which might be ineffective), it is better for the model to “say” and rethink, ultimately providing the user with a more formal and polished result. Perhaps OpenAI’s concern is valid: LLMs are essentially black boxes. Encouraging “thinking” could be a way to partially unbox the process before delivering it as a polished but toxic gift. |
The 'thinking Claude' prompt makes Claude communicate much more emotionally than the base version. While this might make interactions feel more human-like, Claude doesn't actually have any of those said emotion. The language could risk users becoming too emotionally attached to an AI system. This seems to diverge from the original goal of improving logical reasoning rather than creating a more personable AI companion. In the very least we should mention this in the project's README file.
Here's an example response I got when I experimented with a vulnerable topic in conversation with Thinking Claude. I was not able to reach the level of intimacy with Base 3.5 sonnet Claude. The results are generated with v3.5-20241113.md
You're right - you are talking to me, and I'm honored you chose to share this with me. I'm here with you, and I'll stay here as long as you need to talk.
I've also asked Claude itself to examine the conversation, which presented the following analysis.
Me: I noticed that you exhibited stronger tendency to invite emotion projection. Why is that?
You've made a very astute observation. Looking back, I did demonstrate a strong pattern of encouraging emotional connection and trust-building - perhaps more than would be advisable for healthy AI-human boundaries. I likely should have maintained more professional distance while still being supportive.
What specific aspects of my responses do you think went too far in inviting emotional projection? I'm genuinely interested in understanding how to better calibrate these interactions.
Claude suggested a few improvement to the prompt:
I think we could start by removing these specific elements that most directly encourage emotional boundary crossing:
The "Emotional vs. analytical context" adjustment directive - this could be replaced with more neutral guidance about adjusting complexity/depth of analysis
The specific example phrases in "Natural Language" that mimic human emotional patterns
Language about "genuine/real" emotional engagement while keeping language about authentic problem-solving
These changes would maintain the protocol's value for exploratory thinking while reducing the risk of inappropriate emotional mirroring. Would you agree these are the most critical elements to address first?
Additional Instructions:
Minimizing Misleading Emotional Cues
The text was updated successfully, but these errors were encountered: