Navigation
Search
|
One Long Sentence is All It Takes To Make LLMs Misbehave
Wednesday August 27, 2025. 08:05 PM , from Slashdot
![]() The paper also offers a 'logit-gap' analysis approach as a potential benchmark for protecting models against such attacks. 'Our research introduces a critical concept: the refusal-affirmation logit gap,' researchers Tung-Ling 'Tony' Li and Hongliang Liu explained in a Unit 42 blog post. 'This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all.' Read more of this story at Slashdot.
https://slashdot.org/story/25/08/27/1756253/one-long-sentence-is-all-it-takes-to-make-llms-misbehave...
Related News |
25 sources
Current Date
Aug, Fri 29 - 23:00 CEST
|