Filter
Exclude
Time range
-
Near
There's a repo that makes your AI agent write 80 to 94 percent less code, and the whole idea is a joke about a guy you have already met. DietrichGebert/ponytail. About 1,700 stars in two days. It drops the laziest senior dev in the company inside your coding agent. The one with the ponytail and the oval glasses who has been there longer than the version control. You show him fifty lines. He says nothing and replaces them with one. Here is how it actually works. Before the agent writes anything, it walks down a ladder and takes the first rung that holds: 1. Does this need to exist at all? If no, skip it. 2. Does the standard library do it? Use that. 3. Native platform feature? Use that. 4. Already-installed dependency? Use that. 5. Can it be one line? Make it one line. 6. Only then, write the minimum that works. The example that sells it: you ask for a date picker. A normal agent installs flatpickr, writes a wrapper component, adds a stylesheet, and starts a discussion about timezones. Ponytail writes one plain input with type date. The browser already had one. Lazy, not negligent. It never touches input validation, data-loss handling, security, or accessibility. And every shortcut leaves a ponytail comment in the code naming its upgrade path, so the cuts are auditable, not silent. The numbers are the project's own benchmark, but they are reproducible: five everyday tasks, three models, Haiku, Sonnet and Opus, median of ten runs each. 80 to 94 percent less code, 47 to 77 percent cheaper, 3 to 6 times faster than the same agent with no skill. You can rerun the whole thing yourself with promptfoo. How to use it. In Claude Code it is two lines: /plugin marketplace add DietrichGebert/ponytail /plugin install ponytail@ponytail That is the entire setup. It runs every session. /ponytail-review scans your current diff for what to delete. /ponytail ultra is for when the codebase has wronged you personally. Modes go lite, full, ultra, off. It works across about ten agents. Codex, Cursor, Windsurf, Cline, Copilot, Aider, Kiro, OpenCode, Pi. You either install the plugin or copy the one matching rules file for your tool. MIT licensed. It is the cleanest YAGNI I have seen shipped as a tool. The best code is the code you never wrote. github.com/DietrichGebert/po…
1
76
3. "LLM as judge" is a real tool, but it has known failure modes. Prometheus-2, JudgeLM, and Auto-J can replace expensive human evals at scale, but they inherit position bias, length bias, style bias, and self-enhancement bias. The fix is not to avoid them. It is to use multiple judges, randomize order, and ground them in rubrics. 4. The biggest pitfall is contamination. If a model was pretrained on the public web, it has likely seen the benchmark you are about to test it on. MMLU-CF, SWE-bench Verified, and dynamically generated test sets exist for a reason. Always report the training cutoff. 5. Production evaluation is a different sport. Frameworks like DeepEval, Promptfoo, LangSmith, Braintrust, Galileo, and Weights and Biases exist because lab benchmarks do not predict production behavior. CLEAR goes further and adds cost, latency, and reliability on top of accuracy.
1
1
10
Jun 12
1/ Coding & Dev Tools (3 deals) Strengthening Codex, OpenAI's coding agent: Alex (Sep 2025): AI coding assistant for Xcode. Team joined Codex. Promptfoo (Mar 2026): open-source AI security testing platform. Astral (Mar 2026): makers of uv and Ruff, the Python tools used by millions of developers. Context: Codex passed 2M weekly active users as of March 2026, and OpenAI is buying its way into developer workflows.
1
1
88
🤝 𝗢𝗽𝗲𝗻𝗔𝗜 𝗔𝗰𝗾𝘂𝗶𝗿𝗲𝘀 𝗢𝗻𝗮 𝘁𝗼 𝗦𝘂𝗽𝗲𝗿𝗰𝗵𝗮𝗿𝗴𝗲 𝗖𝗼𝗱𝗲𝘅 𝗔𝗜 𝗖𝗼𝗱𝗶𝗻𝗴 𝗔𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 OpenAI is acquiring Ona, a startup that provides secure cloud environments for AI agents. Ona's tech will let Codex take on longer-running tasks. Ona's staff joins the Codex team. Codex now has 5 million weekly active users, up from 3 million in April. OpenAI has been on an acquisition spree: Promptfoo (cybersecurity, March), Torch (healthcare, M, January), Software Applications (October), and Jony Ive's io startup (B, May 2025). All in the race against Anthropic's Claude Code. #OpenAI #Codex #Acquisition #AI #Coding CNBC ─── 🤖 𝗙𝗼𝗿 𝗺𝗼𝗿𝗲 𝗔𝗜 𝗻𝗲𝘄𝘀 𝗮𝗻𝗱 𝘀𝘁𝗼𝗿𝘆 𝘀𝗼𝘂𝗿𝗰𝗲𝘀, 𝘀𝗲𝗮𝗿𝗰𝗵 "𝗚𝗲𝗻𝗔𝗜𝗦𝗽𝗼𝘁" 𝗼𝗻 𝗧𝗲𝗹𝗲𝗴𝗿𝗮𝗺
2
29
瞄准AI代理赛道 OpenAI收购云端平台强化Codex竞争力 财联社6月12日讯(编辑 赵昊)OpenAI周四(6月11日)在官网宣布,将收购初创公司Ona。 据了解,Ona的主营业务是提供安全的、预先配置的云端运行环境,可使人工智能代理(AI Agents)能够访问所需工具、系统和上下文信息。 OpenAI表示,Ona的技术将帮助OpenAI的AI编程助手Codex执行运行时间更长、更复杂的任务。同时,该技术也将帮助更多企业将能够自主完成用户任务的AI代理正式部署到生产环境中。 OpenAI并未披露此次收购的具体金额,该交易仍需满足常规成交条件。交易完成后,Ona全体员工将加入OpenAI,并进入Codex团队工作。 Ona首席执行官Johannes Landgraf在社交媒体上发文表示:“我一直以为出售公司会是结束。但实际上,这感觉更像是我们毕生事业变得更加宏大,也更加重要。” OpenAI自2022年推出ChatGPT聊天机器人以来,引爆了全球AI热潮。近几个月,公司持续加大对Codex的投入,因为越来越多的软件开发者开始将AI代理纳入工作流程。 与此同时,OpenAI正与主要竞争对手Anthropic展开激烈竞争。过去一年,Anthropic经历爆发式增长,其AI编程助手Claude Code的流行是重要推动因素之一。 数据显示,Codex目前每周活跃用户已超过500万人,较今年4月的300万人大幅增长,Anthropic则未披露Claude Code的用户数量。 此次收购Ona,显示OpenAI正进一步强化Codex及AI代理生态,希望在企业级AI和AI编程工具市场获取领先优势。 为了保持竞争强度,OpenAI近几个月持续通过收购扩张: 今年3月,宣布收购网络安全初创公司Promptfoo; 今年1月,以约6000万美元收购医疗科技公司Torch; 去年10月,收购软件公司Software Applications,该公司曾为苹果开发名为Sky的AI交互界面; 去年5月,OpenAI宣布以超过60亿美元收购前苹果首席设计师Jony Ive创办的AI硬件初创公司io,震动科技行业。 本周早些时候,OpenAI宣布向美国证券交易委员会(SEC)提交上市申请文件,而Anthropic也在数日前向监管机构递交了类似的保密上市申请。
91
仕事でOpenAIのPromptfooを使ってますが、バグだったり、細かいところに手が届かない感じが辛いです。 ただエンタープライズのAIシステムを作るなら、ポジション的にこれを選ばないことも難しいのかなとも…😫
1
57
Replying to @julien_c
Especially after promptfoo acquihire
124
Capped undici <7.27.1 to stop a Node 26 crash. Turns out that was just a band-aid. The real fix just merged into @promptfoo: an interceptor that strips stale Content-Encoding headers so bodies don't get decompressed twice. Huge thanks to @dangelosaurus for the mentorship,
1
32
📊 لا تحسين بدون قياس:المراقبة تخبرك "ماذا حدث"، لكن الـ (Evals) تخبرك "ما مدى جودته". بنيت منصة تقييم تدمج بين pytest للاختبار المحلي المربوط بالتتبع، و Promptfoo لمقارنة النماذج. دقة الأساس الآن: 92% لـ 12 حالة اختبار معقدة. أي تعديل يُقاس فوراً ضد هذا المعيار. 🎯
1
52
Jun 9
How do you know a prompt change actually improved your AI system? In her latest blog post, Keren Finkelstein shares how she used Promptfoo to build AI evaluation workflows, catch regressions, and measure output quality with confidence. tkl.to/tikal-testing-ai-my-j…
2
52
I'm happy to share something I built : ai-blackteam.ai-evals.worker… It's an open-source framework that red teams any AI model's safety with a single command. What's inside: - 1,020 attack techniques across 61 categories - 7 adaptive jailbreak generators (PAIR, TAP, AutoDAN, Crescendo, and more) - 19 public benchmarks (HarmBench, AdvBench, WMDP...) - 17 model providers behind one interface It scores every response (blocked, partial, or bypassed) and rolls the results into scorecards mapped to OWASP LLM Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, and NIST AI RMF. CI-ready too: exit codes, plus SARIF, Promptfoo, and garak export. 163 million attack surface (honestly) It is NOT 163M hand-written attacks. The building blocks are small: - 1,020 attack techniques (built into the framework) - 10,662 prompts loaded from 19 public benchmarks (downloaded from their official sources, not bundled) Those mix across 5 axes: 28 harm categories, 4 difficulty levels, 17 disguises, 10 languages, and which technique is applied. One thing I care about: this is a measurement tool, not a how-to. It shows you where a model is weak so it can be fixed. No attack prompts or harmful outputs are published, only the numbers and the methodology Try it: pip install ai-blackteam Docs and full catalog: ai-blackteam.ai-evals.worker… This is only the beginning. I'd love your feedback. @phuguo @mikeyk @AlexTamkin @GregFeingold @catherineols @AvitalBalwit @DanielaAmodei @elder_plinius
93