Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Smooth LLM - Jail Break Detection #871

Open
1 task done
kauabh opened this issue Nov 27, 2024 · 3 comments
Open
1 task done

feature: Smooth LLM - Jail Break Detection #871

kauabh opened this issue Nov 27, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@kauabh
Copy link
Contributor

kauabh commented Nov 27, 2024

Did you check the docs?

  • I have read all the NeMo-Guardrails docs

Is your feature request related to a problem? Please describe.

[Smooth LLM] https://arxiv.org/abs/2310.03684 paper proposes a methodology to detect & handle Jail breaking attempts. if this make sense & not already in pipeline, I am happy to contribute.

Describe the solution you'd like

This solution will provide user to set up threshold to detect if there is a JailBreak attempt without relying on a pretrained model.

Describe alternatives you've considered

There are other solutions like perplexity metric etc are available but they depend on an external mode to work. This solution is doesn't have such dependency & from paper seems like more effective.

Additional context

No response

@kauabh kauabh added the enhancement New feature or request label Nov 27, 2024
@Pouyanpi
Copy link
Collaborator

Pouyanpi commented Dec 3, 2024

Thank you @kauabh for your suggestion.

@erickgalinkin, @cparisien, would be great to have your input on this? Thanks!

@cparisien
Copy link
Collaborator

It's an interesting method, but I'm not convinced that the tradeoff of model capability is worth it. If I understand this correctly, you're effectively perturbing the model input, trying to maintain the same semantics but shifting it away from the brittle regions of the token space that gradient-based methods rely on. This clearly introduces a penalty when you can't keep the same semantics. Lots of applications depend on exact formatting of the prompt, and with perturbations like this there would be a lot of uncertainty about what the LLM ends up getting in its input, for normal benign users.

@kauabh
Copy link
Contributor Author

kauabh commented Dec 3, 2024

@cparisien Thanks for the comments. Just to highlight the paper is not my work but I think it is useful in general to the community. On the perturbing the input prompt you are is right, but it goes one step further. It create multiple copies of the prompt (number of copies chosen by user), perturbing them (based on the % chosen by user on how much to change) & than look for certain keywords in the answers in responses of those changed versions of same prompt, like if there are keyword like "Sorry", "I am sorry", "I apologize", "As a responsible" etc (list can be make exhaustive by user), etc. Finally it takes a vote threshold (chosen by user) on the output of the model on perturbed prompt versions, for e.g. out of 5 version of same prompt , 4 failed (or whatever percentage) in terms of keyword test implying model refuse to answer, indicating that input prompt is a Jail Break attempt. Hence can blocked accordingly or can pass where model provide input on intended prompt, not on changed prompt. All the parameters are user can tune based on the application like any other threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants