Unveil Internal Jailbreak Mechanisms in Large Language Models - arxiv.org/pdf/2509.03985
In this study, we aim to unveil the internal jailbreak mechanisms in LLMs. Specifically, we seek to identify relationships between semantically harmful content and neuron functionality and discover vulnerabilities in LLMs under jailbreak attacks.
By focusing on key neurons, our goal is to enhance alignment efficiency through targeted fine-tuning of critical neurons. Given the complexity of jailbreak prompts, harmful content patterns, and LLM architectures, we propose leveraging visual analytics to systematically and incrementally uncover these mechanisms.
Authors: Chuhan Zhang, Ye Zhang, Bowen Shi, Yuyou Gan, Tianyu Du, Shouling Ji, @Dakzen4, @yc_wu#AISecurity#LLMSecurity#JailbreakDefense#PromptInjection#AdversarialML#SafetyAlignment#ModelInterpretability#AIRedTeaming#SecurityVisualization#NeuroBreak