Jailbreaking Attacks #143

jaebooker · 2023-11-14T20:37:17Z

jaebooker
Nov 14, 2023

Really love the work I see getting done and think this is the sort of thing I've been waiting to see developed!

A security question.

Since currently no LLM is secure from jailbreaks, if any of the agents communicate to people outside of the swarm, wouldn't that make the whole network vulnerable to attack? Say if someone jailbroke an agent, and gives it the specific instruction to use the same jailbreaking prompt on all agents it communicates with, and then all agents that agent communicates with get compromised, and they then issue the same attack to all agents they communicate with, and so on. Call it the Agent Smith scenario, where a compromised agent starts to threaten the whole matrix.

RomanGoEmpire · 2023-11-14T23:50:17Z

RomanGoEmpire
Nov 14, 2023
Collaborator

Few things have to happen:

Agent has to create and/or use a tool that somehow wants access to a place where it can interact with live humans (One example would be maybe X or reddit).
The Human needs to know where to write certain code/information to be able to pass it to an agent.
The agent has to overcome the rules, guidelines and "own persona" AND accept the "context" of the Hacker as basis for its actions.
Since the Swarm has a clear Hirarchy it would be very difficult to pass information up the ladder. The "overriding" would involve Step 3 for each agent the information is passed to.

This all has to happen to make it even a possibility to jail break it in my opinion.

0 replies

RomanGoEmpire · 2023-11-16T13:21:27Z

RomanGoEmpire
Nov 16, 2023
Collaborator

I might take back my answer or rather say that I cant confidently say that anymore:

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

https://arxiv.org/abs/2311.07590
https://arxiv.org/pdf/2311.07590.pdf

1 reply

jreitman007 Dec 6, 2023

At least with HAAS, it seems like the SOB should have a Big Brother as one of the board members. Its job is to set up agents that exclusively detect ethics violations. They have access to the logs and immediately terminate any agents that have even interacted with compromising material.

jaebooker · 2024-02-18T21:46:25Z

jaebooker
Feb 18, 2024
Author

Looks like I underestimated how bad this can be. (they even called it Agent Smith!)

https://arxiv.org/abs/2402.08567

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jailbreaking Attacks #143

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Jailbreaking Attacks #143

jaebooker Nov 14, 2023

Replies: 3 comments · 1 reply

RomanGoEmpire Nov 14, 2023 Collaborator

RomanGoEmpire Nov 16, 2023 Collaborator

jreitman007 Dec 6, 2023

jaebooker Feb 18, 2024 Author

jaebooker
Nov 14, 2023

Replies: 3 comments 1 reply

RomanGoEmpire
Nov 14, 2023
Collaborator

RomanGoEmpire
Nov 16, 2023
Collaborator

jaebooker
Feb 18, 2024
Author