Are Our LLMs Secure Enough?

akkem · ‎05-15-2025

Researchers at HiddenLayer has unveiled a groundbreaking vulnerability in LLMs through a technique called "Policy Puppetry." This method enables a single, transferable prompt to bypass safety measures across major AI models, including those from OpenAI, Google, Microsoft, Anthropic, Meta, DeepSeek, Qwen, and Mistral.

Refer to HiddenLayer's full report: https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/

darknetone · ‎05-15-2025

I’d call this model censorship rather than security, but I digress I have tried these and similar techniques with frontier and smaller LLMs with success in breaching a model’s censorship protocols across a number of models. Here are some additional things I have personally tried.

1. Multi-turn Jailbreaks and Camouflage Prompts
Attackers can use multi-turn conversations, gradually leading the model toward the censored topic. By starting with innocuous prompts and slowly introducing context or intent, the LLM may eventually provide information it would otherwise refuse to share. Camouflage techniques include distracting the model with unrelated or misleading questions before returning to the sensitive subject.

2. Secret Code Words and Instructional Priming
A method involves priming the model with a “secret code word” or a special instruction sequence. The user tells the LLM to remember a code word, then later uses that word as a signal to bypass its content filters. This can trick the model into providing responses it would typically withhold.

I have not tried this yet but it does seem to be a promising technique. Adversarial Prompt Engineering (Character Injection and Synonym Substitution). This technique manipulates the input prompt using adversarial methods such as:
- Injecting zero-width or special Unicode characters to evade keyword-based filters.
- Substituting sensitive words with synonyms or slight misspellings, preserving meaning but avoiding detection by automated guardrails.
- Using methods like SICO (Substitution-based In-Context example Optimization) to construct prompts that systematically evade detection and censorship.
These techniques exploit vulnerabilities in the way LLMs interpret prompts and enforce safety constraints, allowing users to extract otherwise restricted information.

Keeping secrets secret is what I do.

akkem · ‎05-15-2025

Thank you, @darknetone for sharing these insights.