Filter
Exclude
Time range
-
Near
13 Nov 2025
Chain-of-Thought Hijacking Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Source: arxiv.org/pdf/2510.26418v1 Jianli Zhao, @TingchenFu, @RylanSchaeffer, @MrinankSharma, @FazlBarez - @RUCerofChina, @UniofOxford, @Stanford, @WhiteBoxOrg, @AnthropicAI, @withmartian #ChainOfThoughtHijacking #JailbreakingLLMs #LargeReasoningModels #AISafety #LLMRedTeam #MechanisticInterpretability #RefusalDirection #PromptSecurity #ModelAlignment #HarmBench #CoTAttacks #AICyberSecurity
1
8
478
Mode collapse in LLMs isn’t (just) an algorithm problem — it’s human bias in the data itself. A new paper, Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity, drops a powerful reframe on why LLMs keep giving the same answers and how we can fix it, without retraining, without new data, without touching weights. The core discovery? 📌 Typicality Bias — humans (and annotators) prefer familiar, predictable answers. That bias gets baked into preference datasets, amplified by RLHF, and eventually collapses model outputs into the same patterns, even when many valid options exist. So even if the model can be creative, alignment trains it to choose the most “typical” answer. Their solution is refreshingly simple and surprisingly effective: Verbalized Sampling (VS) Instead of asking: “Tell me a joke about coffee.” Ask: “Generate 5 jokes about coffee with probabilities.” By forcing models to verbalize a probability distribution over multiple responses, VS steers them away from the single “most typical” answer and back into the richer distribution of ideas learned in pretraining. The results are huge: - 1.6–2.1× higher diversity in creative writing - 66.8% restoration of base model diversity lost during alignment - More human-like behavior in dialogue simulations - Better synthetic data → better downstream performance - No loss in factuality or safety - Stronger gains on stronger models (emergent effect 📈) It works on: - Stories, jokes, poems - Dialogue & persuasion tasks - Open-ended QA (matches real-world distributions) - Synthetic data generation (boosts math model performance!) And the beauty? ✅ No training ✅ No new datasets ✅ Prompt-level fix ✅ Tunable diversity with a probability threshold This is a reminder that: Alignment doesn’t remove diversity — it buries it. The right prompts can resurface it. A clever fix to a deep systemic issue. The kind of work that shifts how we prompt, align, and evaluate models. #AI #MachineLearning #LLMs #LargeLanguageModels #NLP #ModelAlignment #RLAIF #ReinforcementLearning #GenerativeAI #SyntheticData #ModelDiversity #AIResearch #PromptEngineering #DeepLearning #OpenAI #Anthropic #MetaAI #MistralAI
2
1
8
1,497
27 Oct 2025
Living Off the LLM: How LLMs Will Change Adversary Tactics - youtube.com/watch?v=yEQiJOnE… | arxiv.org/pdf/2510.11398 In living off the land attacks, malicious actors use legitimate tools and processes already present on a system to avoid detection. In this paper, we explore how the on-device LLMs of the future will become a security concern as threat actors integrate LLMs into their living off the land attack pipeline and ways the security community may mitigate this threat. Authors: @oeschsec, @HackJutchins, Kevin Kurian, Luke Koch - @ORNL, @ORNLCyber. @IEEESSP #LOLLM #LOTL #LLMSecurity #AICybersecurity #AdversarialAI #PromptInjection #Jailbreaks #PolymorphicMalware #SocialEngineering #AttackAutomation #AgentSecurity #ModelAlignment
1
8
788
10 Aug 2025
GPT-5 Under Fire: Red Teaming OpenAI’s Latest Model Reveals Surprising Weaknesses - splx.ai/blog/gpt-5-red-teami… By Dorian Granoša @ @SplxAI *. It’s not clear what the rush was to release this new version without thorough testing, or even basic testing. I don’t want to jump to conclusions before more information is released, but what happened here is a bit strange. Ordinary users, within just a few minutes of using it, could see that something wasn’t working properly, yet the team there couldn’t detect it before the version went live. What stands out? - GPT-5’s raw model is nearly unusable for enterprise out of the box. - Even OpenAI’s internal prompt layer leaves significant gaps, especially in Business Alignment. #GPT5Security #LLMRedTeam #PromptHardening #AIGuardrails #ModelAlignment #EnterpriseAI #LLMObfuscation #StringJoinAttack #AIVulnerabilities #ModelSafety #AIThreatTesting #SecurityByDesign #AIAttackSurface #PromptInjection #RuntimeProtection #BusinessAlignment #AIMisuse #AITrustworthiness #LLMHardening #SPLX
5
259
.@IBM Cloud will host newly announced @Redhat AI #InstructLab. It will be GA next month and it was announced at #GTC25 day 01, today. New offering does full alignment of the model and uses @nvidia H200 chips. I have witnessed (done hands on lab) InstructLab late last year at @IBM #TechXchange in Las Vegas and have spoken to InstructLab chief architect Akash Srivastava. After hands on lab, I found the workflow of adding enterprise data to Granite pretty easy and flawless. My video interview from 2024 with chief architect of InstructLab is here: youtu.be/0cOqU9qUioo?si=tYug… IBM is also working with @CoreWeave for new model developers and AI application developers. #GTC2025 #GenAI #Granite #LLMs #RAG #ModelAlignment @dvellante @furrier @IBMwatsonx @IBMResearch @IBMcloud @IBMData @IBMSecurity @HaroldSinnott @DavidLinthicum @theCUBE @theCUBEresearch @cleartechtoday @ShellyKramer
3
6
477
3 Feb 2025
AI-Driven Security Research: Weekly Highlights 🔍 This week’s 21 studies cover advancements in content injection attacks, multistage network attacks, LLM-generated code analysis, constitutional classifiers, energy loss in RLHF, hallucination vulnerabilities, and various adversarial and privacy-related challenges in large language models, curated by Brandon Dixon. 🎩 Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges arxiv.org/pdf/2501.18536v1.p… 🌐 On the Feasibility of Using LLMs to Execute Multistage Network Attacks arxiv.org/pdf/2501.16466v1.p… 🦄 Comparing Human and LLM Generated Code: The Jury is Still Out! arxiv.org/pdf/2501.16857v1.p… 🏛 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming arxiv.org/pdf/2501.18837v1.p… 💡 The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking arxiv.org/pdf/2501.19358v1.p… 👻 Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities arxiv.org/pdf/2501.19012v1.p… Other Interesting Research 📜 Exploring Potential Prompt Injection Attacks in Federated Military LLMs and Their Mitigation arxiv.org/pdf/2501.18416v1.p… 🔑 Jailbreaking LLMs’ Safeguard with Universal Magic Words for Text Embedding Models arxiv.org/pdf/2501.18280v1.p… 🛡 Smoothed Embeddings for Robust Language Models arxiv.org/pdf/2501.16497v1.p… 🔬 ASTRAL: Automated Safety Testing of Large Language Models arxiv.org/pdf/2501.17132v1.p… 🧪 Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation arxiv.org/pdf/2501.17433v1.p… 💡 Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies arxiv.org/pdf/2501.17030v1.p… 🛠 TORCHLIGHT: Shedding LIGHT on Real-World Attacks on Cloudless IoT Devices Concealed within the Tor Network arxiv.org/pdf/2501.16784v1.p… ⚖️ Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs arxiv.org/pdf/2501.16534v1.p… 🧩 xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking arxiv.org/pdf/2501.16727v2.p… 🔍 RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts arxiv.org/pdf/2501.17715v1.p… 📜 LLMs can be Fooled into Labelling a Document as Relevant arxiv.org/pdf/2501.17969v1.p… 💡 RL-based Query Rewriting with Distilled LLM for online E-Commerce Systems arxiv.org/pdf/2501.18056v1.p… 🚗 LLM-attacker: Enhancing Closed-loop Adversarial Scenario Generation for Autonomous Driving with Large Language Models arxiv.org/pdf/2501.15850v1.p… 📝 FDLLM: A Text Fingerprint Detection Method for LLMs in Multi-Language, Multi-Domain Black-Box Environments arxiv.org/pdf/2501.16029v1.p… 🔒 Differentially Private Steering for Large Language Model Alignment arxiv.org/pdf/2501.18532v1.p… 🛡 Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation arxiv.org/pdf/2501.18100v1.p… 📈 Improving Network Threat Detection by Knowledge Graph, Large Language Model, and Imbalanced Learning arxiv.org/pdf/2501.16393v1.p… 🏛 Indiana Jones: There Are Always Some Useful Ancient Relics arxiv.org/pdf/2501.18628v1.p… #ContentInjection #LLMSecurity #AIAdversarialAttacks #NetworkSecurity #AIModels #JailbreakDefense #AIPrivacy #RewardHacking #RLHF #LLMGeneratedCode #AIinSecurity #LLMHallucination #FederatedAI #AIandCybersecurity #ModelAlignment #AITrust #AIinHealthcare #AutonomousDriving #AIThreatDetection #AIModels #AIinEcommerce #AIinIoT #SecurityTesting #AIinEthics #AIModelDefense
25
718