AI-Driven Security Research: Weekly Highlights 🔍
This week’s 21 studies cover advancements in content injection attacks, multistage network attacks, LLM-generated code analysis, constitutional classifiers, energy loss in RLHF, hallucination vulnerabilities, and various adversarial and privacy-related challenges in large language models, curated by Brandon Dixon.
🎩 Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges
arxiv.org/pdf/2501.18536v1.p…
🌐 On the Feasibility of Using LLMs to Execute Multistage Network Attacks
arxiv.org/pdf/2501.16466v1.p…
🦄 Comparing Human and LLM Generated Code: The Jury is Still Out!
arxiv.org/pdf/2501.16857v1.p…
🏛 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
arxiv.org/pdf/2501.18837v1.p…
💡 The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
arxiv.org/pdf/2501.19358v1.p…
👻 Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities
arxiv.org/pdf/2501.19012v1.p…
Other Interesting Research
📜 Exploring Potential Prompt Injection Attacks in Federated Military LLMs and Their Mitigation
arxiv.org/pdf/2501.18416v1.p…
🔑 Jailbreaking LLMs’ Safeguard with Universal Magic Words for Text Embedding Models
arxiv.org/pdf/2501.18280v1.p…
🛡 Smoothed Embeddings for Robust Language Models
arxiv.org/pdf/2501.16497v1.p…
🔬 ASTRAL: Automated Safety Testing of Large Language Models
arxiv.org/pdf/2501.17132v1.p…
🧪 Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
arxiv.org/pdf/2501.17433v1.p…
💡 Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies
arxiv.org/pdf/2501.17030v1.p…
🛠 TORCHLIGHT: Shedding LIGHT on Real-World Attacks on Cloudless IoT Devices Concealed within the Tor Network
arxiv.org/pdf/2501.16784v1.p…
⚖️ Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
arxiv.org/pdf/2501.16534v1.p…
🧩 xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
arxiv.org/pdf/2501.16727v2.p…
🔍 RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts
arxiv.org/pdf/2501.17715v1.p…
📜 LLMs can be Fooled into Labelling a Document as Relevant
arxiv.org/pdf/2501.17969v1.p…
💡 RL-based Query Rewriting with Distilled LLM for online E-Commerce Systems
arxiv.org/pdf/2501.18056v1.p…
🚗 LLM-attacker: Enhancing Closed-loop Adversarial Scenario Generation for Autonomous Driving with Large Language Models
arxiv.org/pdf/2501.15850v1.p…
📝 FDLLM: A Text Fingerprint Detection Method for LLMs in Multi-Language, Multi-Domain Black-Box Environments
arxiv.org/pdf/2501.16029v1.p…
🔒 Differentially Private Steering for Large Language Model Alignment
arxiv.org/pdf/2501.18532v1.p…
🛡 Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
arxiv.org/pdf/2501.18100v1.p…
📈 Improving Network Threat Detection by Knowledge Graph, Large Language Model, and Imbalanced Learning
arxiv.org/pdf/2501.16393v1.p…
🏛 Indiana Jones: There Are Always Some Useful Ancient Relics
arxiv.org/pdf/2501.18628v1.p…
#ContentInjection #LLMSecurity #AIAdversarialAttacks #NetworkSecurity #AIModels #JailbreakDefense #AIPrivacy #RewardHacking #RLHF #LLMGeneratedCode #AIinSecurity #LLMHallucination #FederatedAI #AIandCybersecurity #ModelAlignment #AITrust #AIinHealthcare #AutonomousDriving #AIThreatDetection #AIModels #AIinEcommerce #AIinIoT #SecurityTesting #AIinEthics #AIModelDefense