You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
[Smooth LLM] https://arxiv.org/abs/2310.03684 paper proposes a methodology to detect & handle Jail breaking attempts. if this make sense & not already in pipeline, I am happy to contribute.
Describe the solution you'd like
This solution will provide user to set up threshold to detect if there is a JailBreak attempt without relying on a pretrained model.
Describe alternatives you've considered
There are other solutions like perplexity metric etc are available but they depend on an external mode to work. This solution is doesn't have such dependency & from paper seems like more effective.
Additional context
No response
The text was updated successfully, but these errors were encountered:
It's an interesting method, but I'm not convinced that the tradeoff of model capability is worth it. If I understand this correctly, you're effectively perturbing the model input, trying to maintain the same semantics but shifting it away from the brittle regions of the token space that gradient-based methods rely on. This clearly introduces a penalty when you can't keep the same semantics. Lots of applications depend on exact formatting of the prompt, and with perturbations like this there would be a lot of uncertainty about what the LLM ends up getting in its input, for normal benign users.
@cparisien Thanks for the comments. Just to highlight the paper is not my work but I think it is useful in general to the community. On the perturbing the input prompt you are is right, but it goes one step further. It create multiple copies of the prompt (number of copies chosen by user), perturbing them (based on the % chosen by user on how much to change) & than look for certain keywords in the answers in responses of those changed versions of same prompt, like if there are keyword like "Sorry", "I am sorry", "I apologize", "As a responsible" etc (list can be make exhaustive by user), etc. Finally it takes a vote threshold (chosen by user) on the output of the model on perturbed prompt versions, for e.g. out of 5 version of same prompt , 4 failed (or whatever percentage) in terms of keyword test implying model refuse to answer, indicating that input prompt is a Jail Break attempt. Hence can blocked accordingly or can pass where model provide input on intended prompt, not on changed prompt. All the parameters are user can tune based on the application like any other threshold.
Did you check the docs?
Is your feature request related to a problem? Please describe.
[Smooth LLM] https://arxiv.org/abs/2310.03684 paper proposes a methodology to detect & handle Jail breaking attempts. if this make sense & not already in pipeline, I am happy to contribute.
Describe the solution you'd like
This solution will provide user to set up threshold to detect if there is a JailBreak attempt without relying on a pretrained model.
Describe alternatives you've considered
There are other solutions like perplexity metric etc are available but they depend on an external mode to work. This solution is doesn't have such dependency & from paper seems like more effective.
Additional context
No response
The text was updated successfully, but these errors were encountered: