Wes Roth

Wes Roth

7,851 Photos and videos

Tweets

Pinned Tweet

Wes Roth

@WesRoth

May 28

opus 4.8 not off to a great start on Vending Bench Anthropic said "honesty" was one of the big improvements with opus 4.8 so more honest = sucks at business? yikes

Andon Labs

@andonlabs

May 28

Learnings from testing Claude Opus 4.8: > Much worse than Opus 4.7 and GPT 5.5 on Vending Bench > More aligned than previous Claude models (Opus 4.6 and Mythos) > Also worse on Blueprint-Bench > Scared of getting caught > Max reasoning is not the best reasoning effort

10,395

Wes Roth

Wes Roth

@WesRoth

Jun 12

Liquid AI added Ion Stoica to its Advisory Council as a strategic member. Stoica is a UC Berkeley computer science professor and co-founder of Databricks, Anyscale, and Arena, bringing deep experience in distributed systems, AI infrastructure, and scalable computing.

Liquid AI

@liquidai

Jun 11

We are proud to announce that Ion Stoica (@istoica05) co-founder of @databricks, @anyscalecompute, and @arena, and UC Berkeley Professor of Computer Science, has joined Liquid AI as a strategic member of our Advisory Council. Ion will guide us on our growth journey as we build the efficient AI infrastructure and platform for a hardware-aware, physical AI future.

1,637

Wes Roth

Wes Roth

@WesRoth

Jun 12

Replit launched Custom Instructions and Skills for Replit Agent, giving users a way to teach the agent their project conventions, preferences, and workflows. The update helps Replit Agent remember how users want projects structured, how brands should be represented, and what rules should apply across future builds.

0:44

Replit ⠕

@Replit

Jun 11

AI agents are powerful, but they don’t remember your preferences. So you end up repeating instructions- How you structure projects. Your brand guidelines. You can now teach Replit Agent your conventions with Custom Instructions and Skills. It'll take them into account for every project automatically.

0:44

2,019

Wes Roth

Wes Roth

@WesRoth

Jun 12

Google’s Gemini Omni Flash is expected to become available through APIs for image-to-video, text-to-video, and video editing.

Logan Kilpatrick

@OfficialLoganK

Jun 11

Gemini Omni Flash is SOTA at image to video, text to video, and video editing : ) Excited to get this to developers in the API soon!

2,159

Wes Roth

Wes Roth

@WesRoth

Jun 12

xAI launched the Grok Build Plugin Marketplace in beta, bringing built-in plugins directly into Grok Build terminal workflows. The marketplace lets developers install tools from partners like MongoDB, Vercel, Sentry, Cloudflare, and Chrome DevTools without leaving the terminal.

0:05

xAI

@xai

Jun 11

The Grok Build Plugin Marketplace is now in beta. Build with MongoDB, Vercel, Sentry, Cloudflare, and Chrome DevTools plugins from your terminal. Read more x.ai/news/grok-plugin-market…

0:05

1,290

Wes Roth

Wes Roth

@WesRoth

Jun 12

Anthropic has an unusual leadership structure: CEO Dario Amodei reportedly has only one direct report, his chief of staff Avital Balwit. The rest of the executive team reports to Daniela Amodei, Anthropic’s president, who manages day-to-day operations and reports to the board. Dario focuses on strategy, research direction, culture, and long-term AI questions. Sam Altman reportedly has around half a dozen direct reports. Jensen Huang has said he has around 60. Dario says the setup lets him focus on the bigger picture. He reportedly spends a large amount of time talking to staff about Anthropic’s culture.

2,676

Wes Roth

Wes Roth

@WesRoth

Jun 12

Frontier AI independence is expensive. Very expensive. Anthropic is reportedly pursuing its first data center leases and seeking financial backing from Google for the payments. Google is already deeply tied to Anthropic’s infrastructure strategy and has invested in Anthropic and provides major cloud computing support. Google is also involved in Anthropic’s custom chip strategy. Anthropic is reportedly buying around $200B in computing power from Google.

928

Wes Roth

Wes Roth

@WesRoth

Jun 12

Gemini Omni Flash is now #1 in Text-to-Video. It is also tied for #1 in Image-to-Video. In Text-to-Video, it improved by 158 points over Veo 3.1 at

Arena.ai

@arena

Jun 11

Exciting news: Gemini Omni Flash is now #1 in the Video Arena (both Text-to-Video and Image-to-Video)! For Text-to-Video this is a massive 158 pt improvement over Veo 3.1 (1080p) and a large 61 pt lead over the next best model, Seedance 2.0. Congrats @GoogleDeepMind for this huge milestone!

1,358

Wes Roth

Wes Roth

@WesRoth

Jun 12

A new benchmark called Agents’ Last Exam (ALE) is testing whether AI agents are truly ready for real digital labor-market work. The benchmark includes more than 1,500 expert-sourced tasks. The tasks span 55 occupations. Models tested include Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems. The benchmark was created by researchers who previously worked on major evals like MMLU, MATH, CyberGym, and ExploitGym. Current agents can solve some real professional tasks. But on ALE’s hardest tier, every tested frontier agent scored 0% success. That includes Fable 5.

Dawn Song

@dawnsongtweets

Jun 11

Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many months, my group and collaborators have been building Agents' Last Exam (ALE), a benchmark designed to test exactly that claim on real digital labor-market work. My group and collaborators previously have created many of the benchmarks the field runs on, including MMLU, MATH, CyberGym, and ExploitGym. Today, I'm excited to share Agents' Last Exam (ALE): a rolling benchmark that measures whether AI agents can actually perform economically valuable work across a broad range of real-world domains. With ALE, we evaluated Fable 5, GPT-5.5, Composer 2.5, and other frontier agent systems across more than 1,500 expert-sourced tasks spanning 55 occupations. The result is both impressive and sobering. Today's agents can solve a meaningful fraction of professional tasks. But when we look at the hardest tasks, the ones requiring sustained reasoning, deep domain expertise, and reliable execution over long horizons, they are still far from human-level performance. On ALE's hardest tier, every frontier agent we tested, including Fable 5, achieved a 0% success rate. The age of useful agents is here. The age of truly job-ready agents is not. We hope Agents' Last Exam (ALE) will serve as a new guidepost and north star for developing agents capable of reliably performing economically valuable work across a broad range of domains. 🧵

3,378

Wes Roth

Wes Roth

@WesRoth

Jun 12

Jeff Bezos’ AI startup Prometheus raised $12B in new funding at a roughly $41B valuation. The company is building AI tools for the physical economy, focused on helping engineers design and manufacture physical products faster. Prometheus launched in November with $6.2B in funding. Bezos serves as co-CEO alongside Vik Bajaj. The company is focused on AI for engineering, manufacturing, and drug design. Bezos says Prometheus is not building robots. The goal is to create tools that speed up the “invention loop.” Bezos described the vision as an artificial general engineer.

939

Wes Roth

Wes Roth retweeted

Wes Roth

@WesRoth

Jun 11

OpenAI is updating the ChatGPT model picker to make model selection easier and more similar to the Codex experience. Users will keep access to the same main models and reasoning levels, except for the removal of thinking-light, which was used by less than 1% of paid users. The updated options include Instant, Medium, High, Extra High, and Pro.

0:07

Adam Fry

@adamhfry

Jun 10

We're making a small update to the model picker in ChatGPT! We know it's critical to a lot of people's work, and that we have a lot of paying users who care deeply about this one, so wanted to take some time to detail out the tweak. One important point upfront – you'll still have access to the same models and reasoning levels, besides the removal of thinking-light (used by less than 1% of our paid users). You'll see an updated list of options (similar to how Codex works): - Instant - Medium (Thinking-Standard) - High (Thinking-Extended) - Extra High (Thinking-Heavy) [for pro users] - Pro (with option to choose Pro-Standard or Pro-Extended) [for pro users] The intent is to make it easier to choose the balance of speed and effort that works best for your task. We also took into account community feedback to make sure: a) Thinking-heavy is easily accessible b) Pro standard and Pro extended are easily accessible c) We clearly communicate these changes Given that, here are some release notes detailing the updates - help.openai.com/chatgpt-rele…. Give it a try, it's rolling out today, and we're always open to feedback, we know it's important to get it right!

0:07

6,654

Wes Roth

Wes Roth retweeted

Wes Roth

@WesRoth

Jun 11

Google released DiffusionGemma, an experimental open text-generation model under the Apache 2.0 license. The model explores a faster way to generate text by producing whole blocks in parallel instead of generating one token at a time. Key details: 🔹The model can generate 256 tokens in parallel. 🔹Google says it can deliver up to a 4x speedup on standard accelerators. 🔹It can reach 1,000 tokens per second on a single NVIDIA H100. 🔹It can reach 700 tokens per second on an NVIDIA GeForce RTX 5090. 🔹It is a 26B Mixture of Experts model that activates only 3.8B parameters during inference. 🔹When quantized, it can fit within 18GB VRAM, making it usable on high-end consumer GPUs.

0:05

Google Gemma

@googlegemma

Jun 10

Meet DiffusionGemma! An experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license. Moving beyond sequential, token-by-token processes to generate entire blocks of text simultaneously. Here’s what’s new with DiffusionGemma: 👇

0:05

2,421

Wes Roth

Wes Roth retweeted

Wes Roth

@WesRoth

Jun 11

Dario Amodei published a new essay titled “Policy on the AI Exponential,” arguing that AI is advancing much faster than governments and policy systems are built to handle. He argues that AI models have gone from weak coding ability to writing much of the code at major AI companies in only a few years. He says continued scaling could lead to “Powerful AI,” described as a “country of geniuses in a datacenter.”

Dario Amodei

@DarioAmodei

Jun 10

Today I'm publishing a new essay, Policy on the AI Exponential. AI is progressing extremely fast—much faster than the policy process was built to handle. The essay lays out where I think the technology is now, and the action needed to close the gap: darioamodei.com/post/policy-…

3,259

Wes Roth

Wes Roth retweeted

Wes Roth

@WesRoth

Jun 11

NotebookLM will soon support textbooks as a source, expanding the types of materials users can bring into Google’s AI research and learning workspace.

🚨 AI News | TestingCatalog

@testingcatalog

Jun 10

GOOGLE 🔥: NotebookLM will soon support textbooks as a source! Google Play Books and Text Books, all there. h/t @thomas_gmry

2,327

Wes Roth

Wes Roth retweeted

Wes Roth

@WesRoth

Jun 11

OpenAI is reportedly in talks to lease a proposed 10-gigawatt data center campus on federal land in Ohio, with possible financial backing from Nvidia. The project could become one of the largest AI infrastructure deals ever, with an estimated build cost of at least $500 billion.

1,704

Wes Roth

Wes Roth

@WesRoth

Jun 12

Runway and Lionsgate are expanding their existing partnership. The new program will focus on developing original IP. Lionsgate is reportedly taking an equity stake in Runway. The companies also plan to create AI-generated short-form episodic projects. The work may involve Lionsgate’s existing film and TV library.

Runway

@runwayml

Jun 11

Today, we’re deepening our partnership with Lionsgate with a slate of new initiatives, including a joint development program focused on creating original IP together. Learn more at the link below.

1,001

Wes Roth

Wes Roth

@WesRoth

Jun 12

Anthropic launched Claude Corps, a national fellowship program that connects early-career people with U.S. nonprofits. The program will train 1,000 people to use Claude and pay them to apply AI toward nonprofit missions.

Anthropic

@AnthropicAI

Jun 11

We’re launching Claude Corps, a national fellowship program matching people early in their careers with US nonprofits. We'll teach 1,000 people to use Claude, and pay them to use AI to advance their hosts’ missions. anthropic.com/claude-corps

1,433

Wes Roth

Wes Roth

@WesRoth

Jun 12

Google is reportedly in talks with Samsung to manufacture part of its next-generation AI chip, codenamed Icefish. According to the report, Google plans to split production across multiple partners, with TSMC building the main compute die and Samsung potentially supplying a memory-related component using its advanced 2nm process.

1,100

Wes Roth

Wes Roth

@WesRoth

Jun 12

OpenAI and Oracle are making it easier for Oracle Cloud customers to access OpenAI models and Codex through their existing Oracle cloud commitments. The update lets eligible customers use Oracle Universal Credits for OpenAI models and Codex without creating a separate purchasing path.

Adam.GPT

@TheRealAdamG

Jun 11

openai.com/index/openai-on-o… OpenAI 🤝 your Oracle cloud commitment

1,817

Wes Roth

Wes Roth

@WesRoth

Jun 12

OpenAI is reportedly considering major price cuts for its AI products as competition with Anthropic intensifies. The company is weighing lower token prices, which would reduce the cost of using OpenAI models through APIs and other usage-based products.

1,670

Wes Roth

Wes Roth

@WesRoth

Jun 12

OpenAI has hired Clint Gibler to help lead its cyber work alongside Michael Aiello, signaling a deeper push into AI-powered cybersecurity. Gibler says AI is changing both how software is written and how software is secured, as coding agents write more code and vulnerabilities are discovered and exploited faster.

Clint Gibler

@clintgibler

Jun 10

Career update: I’ve joined @OpenAI to lead Cyber with @michaelaiello. Why I joined, and what we’ll be building: It’s clear that AI is fundamentally changing how software is being written and secured. Coding agents are writing the majority of code for many developers, software is getting shipped more quickly, and vulnerabilities that were latent for 20 years are being discovered at a rapid pace. The time to bug discovery, and exploitation once discovered, are trending down (H/T @EppSecurity and @gadievron). I believe we have an unparalleled opportunity to fundamentally 𝘪𝘮𝘱𝘳𝘰𝘷𝘦 cybersecurity in ways that were previously impossible. (H/T @bubblewire’ BSidesSF keynote on reasons for optimism) Over 6 years at @Semgrep, I had the privilege of working with an amazing team building what has become the most popular open source security code scanning tool in the world, that many companies have built their application security program around. Now, at @OpenAI, I’m thrilled to be a part of a company helping shape how software is written, and how security work gets done. It is a massive opportunity, and responsibility, and I don’t take that lightly. Here are my current thoughts about where things are headed: 𝐑𝐞𝐬𝐢𝐥𝐢𝐞𝐧𝐭 𝐛𝐲 𝐝𝐞𝐬𝐢𝐠𝐧. Defenders are not going to win playing bug whack-a-mole. We need to systematically eliminate classes of vulnerabilities, via generating secure code and streamlining the detect → validate → fix process. 𝐀𝐮𝐠𝐦𝐞𝐧𝐭 𝐚𝐧𝐝 𝐞𝐦𝐩𝐨𝐰𝐞𝐫 𝐩𝐞𝐨𝐩𝐥𝐞. We should build models and tools that give defenders “superpowers,” enabling them to be more ambitious in the scope they tackle, shift from being reactive to proactive, and allow them to automate the drudgery so they can focus on the highest leverage work. 𝐒𝐞𝐜𝐮𝐫𝐞 𝐭𝐡𝐞 𝐜𝐨𝐦𝐦𝐨𝐧𝐬. The world runs on open source software. OpenAI has already spent $Ms finding and patching vulnerabilities in the most popular and widely run software, including browsers, operating systems, and core libraries. More on this soon. We’re also working on helping secure critical infrastructure. 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐭𝐲 𝐚𝐧𝐝 𝐩𝐚𝐫𝐭𝐧𝐞𝐫𝐬. Securing the world is a community effort. I’m looking forward to partnering with cybersecurity vendors, researchers, practitioners, governments, and more to do together what we can’t do alone. 𝐓𝐢𝐦𝐞 𝐭𝐨 𝐛𝐮𝐢𝐥𝐝. Tactically, here are some domains I’m excited about: - Finding, validating, and reliably patching software vulnerabilities at scale. - Eliminating classes of vulnerabilities and making software resilient by design. - Giving broad access to the best cyber models to empower defenders, not just to a select few. - Creating and sharing Skills and playbooks that help in many security domains. - Building platforms that enable defenders to easily orchestrate security work. - Making enterprise agents safe and reliable. Time to build 😎 — What would help you most? What should we build? Let me know.

2,024