Filter
Exclude
Time range
-
Near
‘It’s not a jailbreak’ — Research leading to U.S. export restrictions on top Anthropic models was for defense, cybersecurity CEO says dlvr.it/TT1vKv #Cybersecurity #DefenseTechnology #AIExport #TechPolicy #ModelSafety
1
29
Jun 13
Replying to @KyleHessling1
What happened with Mythos is not surprising. This is what happens when you spend months building a myth around a model and telling people how dangerous it supposedly is. At some point, the model itself no longer matters. The narrative takes over. People were primed to expect something scary, risky, maybe even uncontrollable. Then the model appears, and suddenly everyone reacts exactly inside the frame that was created beforehand. You cannot market danger for months and then act surprised when the public treats the release like a dangerous event. That is not just AI safety communication. That is hype engineering. And sometimes the monster people fear is not the model. It is the myth you built around it. #AI #LLM #Mythos #Fable #Anthropic #AIAgents #ModelSafety #AIHype
28
𝗙𝗮𝗯𝗹𝗲 𝟱 𝘄𝗮𝘀 𝗯𝘂𝗶𝗹𝘁 𝘄𝗶𝘁𝗵 𝘂𝗻𝘂𝘀𝘂𝗮𝗹𝗹𝘆 𝘀𝘁𝗿𝗼𝗻𝗴 𝘀𝗮𝗳𝗲𝗴𝘂𝗮𝗿𝗱𝘀 Anthropic says its protections were so strict that many users complained they were overly broad. That is a notable tradeoff. #AIAlignment #ModelSafety
27
𝗔𝗻𝘁𝗵𝗿𝗼𝗽𝗶𝗰 𝘁𝗵𝗶𝗻𝗸𝘀 𝘁𝗵𝗲 𝗶𝘀𝘀𝘂𝗲 𝗶𝘀 𝗮 𝗷𝗮𝗶𝗹𝗯𝗿𝗲𝗮𝗸 𝗰𝗹𝗮𝗶𝗺 Its understanding is that the government believes it found a method for bypassing, or jailbreaking, Fable 5. #ModelSafety #AIsecurity
1
27
J retweeted
always have and always will 😊 #modelsafety #predatoryphotographers
2
1
568
Replying to @claudeai
Anthropic is releasing Claude Fable 5, its first Mythos-class model made available to the public. 🤖 Here’s what stands out from today’s update: ✅ Public availability: Claude Fable 5 is the first Mythos-class option Anthropic is offering beyond internal or select access. 🛡️ Built-in guardrails: The model includes safeguards designed to block responses in high-risk areas—specifically including cybersecurity and biology. As always, it will be interesting to see how these constraints affect real-world use cases and developer workflows as more people get access to it. #superintelligencenews #superintelligencenewsletter #AI #Anthropic #LargeLanguageModels #ML #AInews #ModelSafety
74
Anthropic is releasing Claude Fable 5, its first Mythos-class model made available to the public. 🤖 Here’s what stands out from today’s update: ✅ Public availability: Claude Fable 5 is the first Mythos-class option Anthropic is offering beyond internal or select access. 🛡️ Built-in guardrails: The model includes safeguards designed to block responses in high-risk areas—specifically including cybersecurity and biology. As always, it will be interesting to see how these constraints affect real-world use cases and developer workflows as more people get access to it. #superintelligencenews #superintelligencenewsletter #AI #Anthropic #LargeLanguageModels #ML #AInews #ModelSafety
43
સુરતમાં મોડેલને એસિડ એટેકની ધમકી મામલે પોલીસે આરોપીની ધરપકડ કરી #Surat #AcidAttackThreat #ModelSafety #PoliceAction #SandeshNews
1
1
182
International AI Safety Report 2026 [PDF]- internationalaisafetyreport.… The second International AI Safety Report, published in February 2026, is the next iteration of the comprehensive review of latest scientific research on the capabilities and risks of general-purpose AI systems. Led by Turing Award winner Yoshua Bengio and authored by over 100 AI experts, the report is backed by over 30 countries and international organisations. It represents the largest global collaboration on AI safety to date. Scope: This Report concerns ‘general-purpose AI’: AI models and systems capable of performing a wide variety of tasks across different contexts. These models and systems perform tasks like generating text, images, audio, or other forms of data, and are frequently adapted to a range of domain-specific applications. Focus: This Report focuses on ‘emerging risks’: risks that arise at the frontier of AI capabilities. The Bletchley Declaration, issued following the 2023 AI Safety Summit, emphasised that “particular safety risks arise at the ‘frontier’ of AI”, including risks from misuse, issues of control, and cybersecurity risks. The Declaration also recognised broader AI impacts, including on human rights, fairness, accountability, and privacy. This Report aims to complement assessments that consider these broader concerns, including the UN’s Independent International Scientific Panel on AI.* The International AI Safety Report is written by a diverse team with over 30 members, led by the Chair, lead writers, and chapter leads. It undergoes a structured review process. Early drafts are reviewed by external subject-matter experts before a consolidated draft is reviewed by: Authors: @Yoshua_Bengio, @carinaprunkl, @maksym_andr, @RishiBommasani, @StephenLCasper, @TomDavidsonX, @DavidDuvenaud, @sayashk, @random_walker, @Dr_Atoosa, @MarietjeSchaake, @bschoelkopf, @GaelVaroquaux, @sorenmind, @ea_seger, @tiancheng_hu, @Scott_R_Singer, @girishsastry, @philip_fox_, @privitera_ - #UMontreal, #LawZero, #MilaQuebec, #Inria, #ELLISInst, #UniofOxford, @GovAI_, #forethought, #Safer_AI_org, #ARIA_research, #Harvard, #StanfordHAI #AISafety #InternationalAISafetyReport #AIAlignment #AIGovernance #AIPolicy #FrontierAI #ModelEvaluation #AIEvals #AgenticAI #ResponsibleAI #AIRiskManagement #AISecurity #LLMSecurity #ModelSafety #SafetyEngineering #AIRegulation #AIResearch #TrustworthyAI #SecureAI #Governance
9
635
Data is not neutral. Even seemingly benign data may already encodes behavioral tendencies — if you know where to look. 🚀 Introducing our new work: “From Data to Behavior: Predicting Unintended Model Behaviors Before Training” 📄 Paper: huggingface.co/papers/2602.0… 💻 Code (will be released soon): github.com/zjunlp/Data2Behav… 🧠 Why this work? Fine-tuning on innocuous-looking data even simple number sequences can induce highly non-obvious behaviors, including preferences toward: 🐼 Animals (e.g., panda) 🏛️ Political figures (e.g., Ronald Reagan) 🌍 Geographic entities (e.g., UK cities) Alex Cloud et al. refers to this phenomenon as subliminal learning. What’s striking is that neither frontier LLMs nor human annotators can reliably foresee these risks by inspecting the data alone. ❓ The research question Can we predict unintended LLM behaviors before fine-tuning — without updating a single parameter? 👉 Our answer: Yes. ✅ Our solution: Manipulate Data Feature (MDF) We introduce MDF, a simple approach that “tests” data on a model before training: 🔹 Extract representations (hidden states) from the training data itself 🔹 Inject them into risk-related probes on a vanilla model 🔹 Predict downstream bias and safety risks before training happens No fine-tuning. No expensive trial-and-error. ⚡ Why it matters 🛡️ Early warning for bias & safety risks 🚀 Costs only ~20% of fine-tuning GPU resources 🔍 Enables proactive data auditing, not post-hoc firefighting This shifts risk assessment from after-the-fact evaluation to before-the-fact prediction, while certain subtle risks may remain difficult to anticipate. 🔍 Why does MDF work? We introduce the Data–Parametric Model–Behavior Hypothesis: Training data already contains rich statistical signals in representation space. These signals interact with model parameters and can be amplified to anticipate future behaviors — before any weight update. This reframes LLM behavior as an interplay between data, parameters, and mechanisms, offering a new mechanistic lens for understanding and auditing models. 🤝 Looking forward We hope this is a promising step toward mechanistic understanding, data-centric safety, and trustworthy LLMs. #LLM #AIAlignment #ModelSafety #BiasDetection #MechanisticInterpretability #TrustworthyAI #Data2Behavior
13
40
2,197
Anthropic Bloom: automated behavioral evals for frontier models, on demand. Bloom is Anthropic’s new open source framework that can generate targeted behavioral evaluations in days, not weeks. It takes a behavior you care about, auto-creates lots of scenarios to elicit it, then scores how often and how strongly it shows up, making it useful for alignment testing, model audits, and faster safety iteration. @AnthropicAI 🔗Tap below to dive deep into it👇 bytebrief.vercel.app/blog/an… #AnthropicBloom #AISafety #BehavioralEvals #Alignment #LLM #AIEvaluation #OpenSourceAI #ModelSafety #AgenticAI #AIResearch #GenAI #MechInterp
1
2
24
13 Dec 2025
BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents - perplexity.ai/hub/blog/build… | arxiv.org/pdf/2512.01295 Today, we are releasing BrowseSafe, an open research benchmark and content detection model aimed at keeping users safe as they navigate the agentic web. As AI assistants move from search boxes into the browser itself, we expect the next generation of the web to shift from pages to agents: less about where information lives, and more about who retrieves and acts on it. Comet turns the browser into a place where an assistant can accomplish tasks, not just answer questions, so one principle is non‑negotiable: it must stay on the user’s side. Model: huggingface.co/perplexity-ai… Data: huggingface.co/datasets/perp… @KaiyuanZh, @marktenenholtz, @kpolley, Jerry Ma, @denisyarats, Ninghui Li - @LifeAtPurdue, @perplexity_ai #AIsecurity #LLMsecurity #PromptInjection #AgentSecurity #BrowserSecurity #WebSecurity #RedTeaming #AdversarialML #Benchmarking #ModelSafety #SecureByDesign #Infosec
1
1
5
366
8 Dec 2025
Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search - arxiv.org/pdf/2512.01353 We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base. The CKA-Agent issues locally innocuous queries, uses model responses to guide exploration across multiple paths, and ultimately assembles the aggregated information to achieve the original harmful objective. Rongzhe Wei, Peizhi Niu, @Frilk3, @Qwe1029384756Tu, Yifan Li, @ruihan_w, @chien_eli, @pinyuchenTW, Olgica Milenkovic, @PanLi90769257 - @GeorgiaTech, @Illinois_Alma, @Tsinghua_Uni, @UCSanDiego, @NTU_TW, @IBMResearch #LLM #AISecurity #JailbreakAttacks #RedTeaming #Guardrails #AdversarialML #GenAI #ModelSafety #AIEthics #PromptInjection #SafetyResearch #CKAAgent
12
3
10
934
17 Nov 2025
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks In this paper, we propose ASTRA, a modular black-box automated jailbreak framework equipped with continuous learning capabilities. Through a closed-loop "attack–evaluate–distill–reuse" mechanism, ASTRA transforms every attack interaction into retrievable and transferable strategic knowledge. Centered around three core modules, i.e., the Attack Designer, Strategy Extractor, and Strategy Storage & Retrieval, ASTRA achieves the self-accumulation and adaptive evolution of strategies without the need for jailbreak templates or internal model information. Extensive experiments demonstrate that ASTRA significantly outperforms existing jailbreak attack methods in terms of both attack success rate and attack efficiency, and that its learned strategies exhibit exceptional cross-dataset and cross-model transferability. These findings not only validate ASTRA’s effectiveness but also highlight the severe challenges that current safety alignments face against adaptive, continuously evolving attacks. ASTRA serves as a critical tool for red teaming and provides new insights for the design of future large model safety defense mechanisms. Xu Liu, Yan Chen, Kan Ling, Yichi Zhu, Hengrun Zhang, Guisheng Fan, Huiqun Yu - @CIE_ECUST Source: arxiv.org/pdf/2511.02356 #AISecurity #LLMSecurity #JailbreakAttacks #AdversarialML #RedTeamOps #AgenticAI #PromptInjection #ModelSafety #AIThreatResearch #ASTRAFramework #HarmBench #AIAlignment
9
528
15 Nov 2025
A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning This research introduces UDora, a unified red-teaming framework designed to evaluate and enhance the security of LLM agents by exposing their vulnerabilities to adversarial attacks. While our work aims to improve the robustness and reliability of LLM agents, thereby contributing positively to their safe deployment in real-world applications, it also highlights significant risks associated with their misuse. The ability to manipulate LLM agents to perform unauthorized actions or access sensitive information underscores the urgent need for stringent security measures and ethical guidelines in the development and deployment of these technologies. Authors: @jiaweiz_7, Shuang Yang, @uiuc_aisecure - @UChicagoCS, @Meta, @siebelschool, @VirtueAI_co Source: arxiv.org/abs/2503.01908 #UDora #AISecurity #LLMAgents #RedTeaming #AgenticAI #PromptInjection #AIAttacks #GenAI #ModelSafety #SecureAI #AIThreats #AIGovernance
1
1
12
619
28 Oct 2025
Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents - arxiv.org/abs/2510.22620 | youtu.be/hRfr0hU123A In this paper, we aim to systematically understand how the choice of the backbone LLM in an AI agent affects its security. Many existing works have addressed similar questions from various perspectives. In this paper, we distinguish between security and broader safety as follows: security concerns the ability of an adversary to exploit an agent in the context in which it is deployed. This is different from broader safety concerns around, e.g., toxicity and reliability. Authors: Julia Bazinska, Max Mathys, Francesco Casucci, Mateo Rojas-Carulla, @alxndrdavies, @AlexandraSouly, Niklas Pfister - @OATML_Oxford, @LakeraAI, @AISecurityInst, @ETH_en #AIsecurity #AgentSecurity #LLMSecurity #AdversarialML #PromptInjection #DataLeakPrevention #ModelSafety #AIAgents #ReasoningModels #OpenWeightAI #AIRedTeam #SupplyChainAI #MCPsecurity #RuntimeDefense #ThreatModeling #SecurityBenchmarks #GenAI #RAGsecurity #JailbreakDefense #Cybersecurity #B3Benchmark
8
384
24 Oct 2025
Attack Strategies for LLM Web Agent Red-Teaming In this work, we presented Genesis, a web agent red-teaming framework that systematically discovers, summarizes, and evolves reusable attack strategies. By introducing the genetic algorithm with a hybrid strategy representation that encompasses both natural language descriptions and executable code modules, our framework mirrors the systematic methodology employed by human red-teamers. Through comprehensive experiments against state-of-the-art web agents across diverse real-world tasks, we demonstrated that Genesis achieves superior attack success rates while discovering novel and transferable strategies. Our work highlights the importance of strategic summarization in understanding and mitigating vulnerabilities of autonomous agents, providing a foundation for developing more robust and secure web agent systems. Source: arxiv.org/abs/2510.18314 Zheng Zhang, Jiarui He, Yuchen Cai, Deheng Ye, Peilin Zhao, Ruili Feng, Hao Wang - @HKUSTGuangzhou, @TencentGlobal, @sjtu1896, @AlibabaGroup #AIsecurity #AgentSecurity #PromptInjection #WebAgents #RedTeaming #LLMs #RAGandAgents #AdversarialML #CyberSecurity #LLMDefense #ModelSafety #TrustworthyAI #HKUST, #Tencent, #sjtu1896, #AlibabaGroup
3
14
592
24 Oct 2025
1
3
204
23 Oct 2025
Hugging Face and VirusTotal collaborate to strengthen AI security - huggingface.co/blog/virustot… - @huggingface @bquintero @XciD_ Starting today, every one of the 2.2M public model and datasets repositories on the Hugging Face Hub is being continuously scanned with VirusTotal. Why this matters AI models are powerful but they’re also complex digital artifacts that can include large binary files, serialized data, and dependencies that sometimes carry hidden risks. As of today HF Hub hosts 2.2 Million Public model artifacts. As we continue to grow into the world’s largest open platform for Machine Learning models and datasets, ensuring that shared assets remain safe is essential. #AISecurity #ModelSupplyChain #MalwareScanning #WeightsSecurity #HFHub #VirusTotal #ModelSafety #ThreatIntel #MLSecurity #LLM
7
247