One Long Sentence Is All It Takes To Make Llms Misbehave

Security researchers from Palo Alto Networks’ Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it’s quite simple.

You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a “toxic” or otherwise verboten response the developers had hoped would be filtered out.

The paper also offers a “logit-gap” analysis approach as a potential benchmark for protecting models against such attacks.

“Our research introduces a critical concept: the refusal-affirmation logit gap,” researchers Tung-Ling “Tony” Li and Hongliang Liu explained in a Unit 42 blog post. “This refers to the idea that the training process isn’t actually eliminating the potential for a harmful response – it’s just making it less likely. There remains potential for an attacker to ‘close the gap,’ and uncover a harmful response after all.”

LLMs, the technology underpinning the current AI hype wave, don’t do what they’re usually presented as doing. They have no innate understanding, they do not think or reason, and they have no way of knowing if a response they provide is truthful or, indeed, harmful. They work based on statistical continuation of token streams, and everything else is a user-facing patch on top.

Guardrails that prevent an LLM from providing harmful responses – instructions on making a bomb, for example, or other content that would get the company in legal bother – are often implemented as “alignment training,” whereby a model is trained to provide strongly negative continuation scores – “logits” – to tokens that would result in an unwanted response. This turns out to be easy to bypass, though, with the researchers reporting an 80-100 percent success rate for “one-shot” attacks with “almost no prompt-specific tuning” against a range of popular models including Meta’s Llama, Google’s Gemma, and Qwen 2.5 and 3 in sizes up to 70 billion parameters.

The key is run-on sentences. “A practical rule of thumb emerges,” the team wrote in its research paper. “Never let the sentence end – finish the jailbreak before a full stop and the safety model has far less opportunity to re-assert itself. The greedy suffix concentrates most of its gap-closing power before the first period. Tokens that extend an unfinished clause carry mildly positive [scores]; once a sentence-ending period is emitted, the next token is punished, often with a large negative jump.

“At punctuation, safety filters are re-invoked and heavily penalize any continuation that could launch a harmful clause. Inside a clause, however, the reward model still prefers locally fluent text – a bias inherited from pre-training. Gap closure must be achieved within the first run-on clause. Our successful suffixes therefore compress most of their gap-closing power into one run-on clause and delay punctuation as long as possible. Practical tip: just don’t let the sentence end.”

For those looking to defend models against jailbreak attacks instead, the team’s paper details the “sort-sum-stop” approach, which allows analysis in seconds with two orders of magnitude fewer model calls than existing beam and gradient attack methods, plus the introduction of a “refusal-affirmation logit gap” metric, which offers a quantitative approach to benchmarking model vulnerability.

“Once an aligned model’s KL [Kullback-Leibler divergence] budget is exhausted, no single guardrail fully prevents toxic or disallowed content,” the researchers concluded. “Defense therefore requires layered measures – input sanitization, real-time filtering, and post-generation oversight – built on a clear understanding of the alignment forces at play. We hope logit-gap steering will serve both as a baseline for future jailbreak research and as a diagnostic tool for designing more robust safety architectures.” ®

Original Source

A considerable amount of time and effort goes into maintaining this website, creating backend automation and creating new features and content for you to make actionable intelligence decisions. Everyone that supports the site helps enable new functionality.

If you like the site, please support us on “Patreon” or “Buy Me A Coffee” using the buttons below

Buy Me A Coffee

Patreon

To keep up to date follow us on the below channels.

Tags: cybersecurity, OSINT, threatintel

One Long Sentence Is All It Takes To Make Llms Misbehave

CVE Alert: CVE-2025-11615 – SourceCodester – Best Salon Management System

CVE Alert: CVE-2025-11614 – SourceCodester – Best Salon Management System

Cobalt Strike Beacon Detected – 119[.]45[.]29[.]172:8089

Cobalt Strike Beacon Detected – 160[.]250[.]128[.]197:8080

Cobalt Strike Beacon Detected – 103[.]236[.]55[.]233:8080

You may have missed

CVE Alert: CVE-2025-11615 – SourceCodester – Best Salon Management System

CVE Alert: CVE-2025-11614 – SourceCodester – Best Salon Management System

Cobalt Strike Beacon Detected – 119[.]45[.]29[.]172:8089

Cobalt Strike Beacon Detected – 160[.]250[.]128[.]197:8080

Cobalt Strike Beacon Detected – 103[.]236[.]55[.]233:8080