building & securing agents | Staff Engineer @zenitysec | ex msft | climbing, surfing & eating hummus

Joined August 2009
11 Photos and videos
Roey Ben Chaim retweeted
Jun 11
OmG.
755
22,982
181,776
3,563,597
I think i saw Larry David smile #knicks
1
3
1,067
I've been following the team for a while now - and I love the walkthrough feature. Reading code was always harder than writing it, and now agents are writing it in copious amounts - Jimmy's solution for this is super creative.
If AI’s coding 100x faster, why aren’t you shipping 100x faster? I’ve interviewed dozens of builders to find out. Here’s what’s slowing you down
1
1
6
2,570
Roey Ben Chaim retweeted
Seeing a number of benchmarks showing Opus is the best model for long-running work. Five tips for running Opus autonomously for hours/days: 1. Use auto mode for permissions, so Claude doesn’t ask for approval 2. Use dynamic workflows, to have Claude orchestrate hundreds/thousands of agents to get a task done 3. Use /goal or /loop, to nudge Claude to keep going until it’s done 4. Use Claude Code in the cloud, so you can close your laptop (easiest way is the desktop or mobile app) 5. Make sure Claude has a way to self-verify its work end to end: Claude in Chrome browser extension for web, iOS/Android sim MCP for mobile, a way to start the full web server or service for backend work
Can coding agents stay coherent over a 1 billion token budget? Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust? Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
313
280
3,477
641,269
Incredible work here on long horizon tasks: from solving reward hacks to verifying full stack tasks.
Can coding agents stay coherent over a 1 billion token budget? Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust? Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
1
107
skills are great at fetching the right context at the right time. but not all context is good for you 😈 come watch @mbrg0 @ blackhat this summer to see what we found
excited to speak about our agent detonation chamber this summer at #BHUSA! how do you 'scan' txt for 'security badness'? not w wishful analysis by an llm judge what we really want is: what will this thing cause my agent to *DO*? ft/ francesco montorsi @lana__salameh @roeybc
1
85
If you wanna help shaping AI for Science, there's no better place than this initiative 👇
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-science-an… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵
3
73
Roey Ben Chaim retweeted
We got an 8-figure acquisition offer 2 days after launch. We said no, because the problem we're solving is worth way more than that. It’s 2026, but teams are only getting lonelier, and context is still the problem. The issue isn’t intelligence. Your team has plenty of that. It’s shared memory and context, the thing that makes 10 A-players feel like 1. That’s what we’ve solved with @playdotfast, while making work more fun. We're killing traditional SaaS, and believe you me, we're leaving no holds barred.
326
466
2,358
4,015,041
Roey Ben Chaim retweeted
🤗🤗🤗introducing Hugging Science -- the home of AI for science 🤗🤗🤗 open models and datasets are the powerhouse of science (see the PDB), but finding the models and data you actually need for your breakthrough is hard af you shouldn't need to scrape arxiv, own your own wetlab, fight a custom HDF5 parser, build a fusion stellarator, and beg for compute before you've trained a single epoch so we're changing that we've put all the best science on @huggingface in one place: - 78GB of genomics data - 11TB of PDE simulations - 100M cell profiles - 9T DNA base pairs - 13M molecular trajectories - 400k medical QA pairs and much more, all open, and all ready for training ( you can also now filter and search by domain, task, and keyword) we've put together all the biggest releases from our partners at NASA, Google, OpenAI, Meta FAIR, Arc Institute, Ginkgo, SandboxAQ, Proxima Fusion, NVIDIA, Ai2, OpenADMET, InstaDeep, Future House, Polymathic AI, LeMaterial, Earth Species Project, Merck, and Eve Bio if you're not sure where you fit in -- work on open challenges for problems that matter: including fusion stellarator design, ADMET, antibody developability, multilingual medicine, catalysis and materials, and scientific reasoning. we're already changing how science gets done: a fusion startup needed a benchmark for stellarator plasma confinement that didn't exist. @proximafusion shipped ConStellaration on Hugging Science: a leaderboard, dataset, and eval metrics, all in one place. a drug discovery team wanted to predict hPXR induction. OpenADMET put up a blind challenge: 11,000 compounds assayed at Octant, 513 held out, two tracks (pEC50 structure). Anyone in the world can train and submit. an antibody team at @Ginkgo released GDPa1, a developability dataset for stability, manufacturability, and immunogenicity prediction, with a live leaderboard scoring every submission. if you know a problem the ML community should be working on, let us know. make a challenge! this is about putting all the tools for solving science in one place. so we can hillclimb! → huggingscience.co
55
350
1,808
198,367
Roey Ben Chaim retweeted
Ai generated prs be like
22
212
4,143
113,040
Roey Ben Chaim retweeted
A principle of security is that you should never assume you're the only one who can figure something out – and that therefore, it's best to be open about tools, findings, and methods.
Apr 23
Anthropic’s Mythos raised the bar for AI vuln detection but kept it invite-only. GPT-5.5 is OpenAI’s answer, and it’s open to all. We had early access. Ran the benchmarks. Blackbox GPT-5.5 already beats whitebox GPT-5. Best pentesting model we’ve tested. Read our analysis: bit.ly/48OX7v6
11
37
257
44,207
Roey Ben Chaim retweeted
SkillsBench is now cited by HY-3 model card. Congrats to @TencentHunyuan on the launch and kudos to the SkillsBench team / community! We've made a lot of improvements to the tasks, codebase, tooling in the past month, based on feedbacks from users and lab partners. We will also share updated leaderboard with more models and agent harnesses soon, stay tuned!
1
4
23
1,584
Roey Ben Chaim retweeted
just fix the problems your users have as fast as possible. if you need to, then build a harness, if it works out of the box, use the existing code. its just stacked while loops.
1
4
10
3,840
ok this keeps on happening so for the 1000th time: Microsoft Teams has a web client - YOU DO NOT NEED TO RUN A SCRIPT TO JOIN A MEETING.
Almost got hacked this morning - here's a replay of what happened: 1. A VC whom I've met in person reached out for a catchup 2. She sent me a Microsoft Teams link a few min ahead of the meeting 3. When I joined, it asked me to download update script 4. Got a funny feeling and ended the call immediately 5. Claude inspected the file and it was indeed malicious Not sure if this person was just hacked or a bad actor, but I wanted to post this as a PSA. Stay safe.
145
Someone needs to start curating these things…
asked claude to fix an nccl comms error between gpus. it replaced nccl with http. the gpus are now emailing each other their gradients. problem solved, technically. i have never been more impressed.
1
80
yeah no codex is really good now
1
1
81
wait why is it using perl?
31
Roey Ben Chaim retweeted
@steipete myself (@RoyZalta) & @Michaelliav99 are hosting a @Microsoft × @openclaw 🦞× @NousResearch 🤖LIVE event in Tel Aviv 🇮🇱🔥 Would love your support, even a quick 5-minute drop-in call to congratulate the team 🙌 #openclaw @openclaw #HermesAgents #AgenticAI #AI #GenAI #Microsoft #TechEvents #TelAviv #Startups #AICommunity
5
2
27
363
Roey Ben Chaim retweeted
Apr 15
This is the spirit of Silicon Valley. Let me tell you a story. On 12/21/2025, Xiangyi called me with a pitch: let's gather a team and build a new benchmark — SkillsBench — following the community-contribution model of Terminal-Bench (a project we're both contributors on). We'd reuse the "harbor" infra so we wouldn't have to reinvent the wheel. He said skills were just recognized by Anthropic and this was the perfect timing. So I asked: what can you offer contributors in return? "Authorship on an ICML 2026 paper." I asked how many citations we could realistically expect. We looked at comparable work like MCPBench — only a handful of citations. And honestly, at that point, Benchflow was nothing (bear with me, @xdotli). No successful project. No track record. This was the first paper Xiangyi had ever led — or ever written. No professor advising. No experience managing a large-scale open source community — and we all know how hard that is. People sign up and never contribute. Deep down, I was ready to say no and spend my time on something with a safer payoff. But then Xiangyi said something that stuck with me: "If we somehow make it, I know how to make it go viral on X." On paper, there was no reason to believe him. But it wasn't what he said — it was how he said it. There was something in his voice that night. No hesitation. No hedging. Just raw, almost irrational conviction that this was going to work. I'd talked to plenty of people with ideas before, but this was different — this was founder energy. The kind where someone has already decided the outcome and is just looking for people willing to run alongside them. So I took a leap of faith. I decided to bet on the person, not the project. That's why I joined SkillsBench and Benchflow. And it did go viral. @garrytan and many others reposted us. We hit a few million views. The paper already has 27 citations. He personally got 3k followers. And many more projects, like ClawsBench are on the way. Fast forward to today — Xiangyi is turning down multiple 10M acquisition offers and 1M personal compensation to keep pushing Benchflow's vision. From a guy with no paper, no track record, and nothing but conviction on a December phone call — to building something unicorn companies want to buy. In 3 months. This is the story of SkillsBench and Benchflow. If you're determined enough, the world will rearrange itself around you. Go for it.
Just logged in on @benchflow_ai LinkedIn and wow we are popular We are a data and environment lab 📐We turned down multiple 8 figure acquisition offers from unicorn companies and 7 figure compensation for me to push benchflow's vision. If environment and benchmark is your thing, I want to chat with you! reply / dm and let's set up a time 🎉
5
4
24
1,909
Roey Ben Chaim retweeted
your daily reminder for npm security best practices
1
7
89
7,368