shaped

shaped

61 Photos and videos

Tweets

shaped

@shaped

May 29

Claude Opus 4.7 solves its first ProgramBench task 👀 After creating a 663-word CLAUDE.md we were able to increase solve percentage by 1.1 points on Claude Code Claude Opus 4.7 across 10 ProgramBench, and actually solve one of them. You can replicate our results here: github.com/parsewave/Program… big thanks to @KLieret @jyangballin for releasing such a forward-thinking benchmark!

GitHub - parsewave/ProgramBench-experiments

Contribute to parsewave/ProgramBench-experiments development by creating an account on GitHub.

github.com

125

emot-sun.gif industries

shaped retweeted

emot-sun.gif industries

@jjpcodes

May 15

singapore eats part one; hainan chicken rice; insane buffet of crab, skewers, noodles, chicken; you name it, at Newton, food coma afterwards

3,572

Vincent Koc

shaped retweeted

Vincent Koc

@vincent_koc

May 4

For my eval-maxxing nerds out there, good friends of mine are running a series called "strange evals", you can benchmaxx now on anything. If in SF swing by! luma.com/lvqbs1mo

Strange Evals - VLMs · Luma

a paper reading club but we deep dive into benchmarks ---- Each session we dive deep on a related battery of widely cited benchmarks so we can develop an…

luma.com

5,206

shaped

shaped

@shaped

Apr 9

Happy to be working with meta for the past 7 month, and seeing the fruits of their labor. Great release!

AI at Meta

@AIatMeta

Apr 8

Introducing Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration. Muse Spark is available today at meta.ai and the Meta AI app. We’re also making it available in private preview via API to select partners, and we hope to open-source future versions of the model. Learn more: go.meta.me/43ea00

186

Factory

shaped retweeted

Factory

@FactoryAI

Apr 2

No major benchmark is designed for COBOL, Fortran, or Assembly - the languages powering trillions in transactions and infrastructure that must be modernized or risk catastrophic failure. We built Legacy-Bench to measure frontier agents on the code the world actually runs on.

356

52,966

Ivan Bercovich

shaped retweeted

Ivan Bercovich

@neversupervised

Mar 21

x.com/i/article/203545140290…

101

82,939

shaped

shaped

@shaped

Feb 26

The real-time web search capability is Nano Banana 2's secret sauce. Midjourney and DALL-E train once and call it done. NB2 pulls from the live internet every time. The outputs reflect what's happening now, not what the world looked like months ago, pretty genius That image-sequencing demo is pretty great. Each frame only sees the previous frame so there's no cheating, no outside reference like a world model would have. Yet the visuals stay consistent. That's the model understanding cause and effect in visual space. Coherent video generation is the natural next step, and it's closer than people think.. Also pretty nice to see them roll it out all at the same time, API, platform, aistudio. Looks like only text model releases are somewhat staggered at google. Really good launch

Demis Hassabis

@demishassabis

Feb 26

Nano Banana 2 is our new faster and better SOTA image generation & editing model! It uses Gemini’s amazing world understanding grabs real-time info w/ search to create higher quality outputs. Available in @GeminiApp, @GoogleAIStudio, @FlowbyGoogle, Search & Vertex - enjoy!

332

shaped

shaped

@shaped

Feb 20

Horizons are LITERALLY too long to METR at this point, look at those error margins!

Karel

@KarelDoostrlnck

Jan 15

Horizons too long to METR

676

shaped

shaped

@shaped

Feb 20

Claude is beginning to annoy me a lot. Their webUI forces the model to create docx files using JS whenever I ask for some structured output for a moderately deep research. Is this a way to rack up more input tokens and wear out my 5 hour limits much faster?

228

shaped

shaped

@shaped

Feb 19

Unbelievable that we are actually approaching saturation on both ARC AGI 1 and 2.

ARC Prize

@arcprize

Feb 19

Gemini 3.1 Pro on ARC-AGI Semi-Private Eval @GoogleDeepMind - ARC-AGI-1: 98%, $0.52/task - ARC-AGI-2: 77%, $0.96/task Gemini to push the Pareto Frontier of performance and efficiency

230

shaped

shaped

@shaped

Feb 19

I think we should really stop taking SVG tests as a benchmark in the first place. This seems easy to benchmaxx and doesn't denote any sort of real life performance whatsoever

Angel 🌼

@Angaisb_

Feb 19

This is so cool Can't believe this jump in capabilities is just a .1 update

0:27

213

shaped

shaped

@shaped

Feb 14

AI is going to hit the finance world like a freight train

Kimi Product

@KimiProduct

Feb 14

One-shot McKinsey-grade industry report by Kimi K2.5 Agent One Prompt = 14-Page Word file with consulting-level data visualization, technical heatmaps, and strategic frameworks. If your work requires competitive analysis with professional fomatting, Kimi has you covered. kimi.com/chat/19c5c0ad-f2b2-…

335

shaped

shaped

@shaped

Feb 13

pack it up bro

Ross Barkan @RossBarkan

Feb 13

You can ask one question: does AI have a business model? It's not a fun answer.

273

shaped

shaped

@shaped

Feb 13

The Turing test is just undercover customer support?!

This tweet is unavailable

243

shaped

shaped

@shaped

Feb 13

My current development stack: - OpenCode (the best harness ever) - Kimi K2.5 on the moderato code plan, incredibly capable model - OpenClaw (for remote server management and running ralph) - ralph-manager skill for openclaw we made internally which spawns ralph instances with nice monitoring - tmux for terminal multiplexing - nvim for making code edits if needed - termux for accessing my running tmux sessions from my mobile phone - tailscale to connect to my devices remotely from anywhere with ease The fact that this is even possible right now is crazy. Long live open source man

259

shaped

shaped

@shaped

Feb 12

what

Nikita Bier

@nikitabier

Feb 12

Replying to @DCinvestor @X

We don't have the capacity to support more than two colors right now. But feedback noted: we are looking into lightening the black on web.

141

shaped

shaped

@shaped

Feb 11

It's weird seeing a lot of "uncollapsible" systems collapse with the advent of AI. And it's all thanks to the volume content is coming out, not AI itself. Social media algorithms? Destroyed. "Slop" coming out so quickly is flooding social media at rates never before seen. We can't tell who's a person, who's an agent and who's a bot. Likes and shares no longer define if the content was actually authentic and valuable or not. Copyright and IP laws? Cooked. At the rate at which violations are coming out thanks to image and video models it is simply unenforceable now. I don't think this is an AI problem. These systems banked on the low user base to function. Likes and shares worked because there were just a few people posting on the platform. Copyright laws worked because only a few people could violate it and it was largely enforceable at that scale This means we shouldn't try and abolish AI to make these systems work. It means we need to work around it to setup robust systems which don't break under volume

129

shaped

shaped

@shaped

Feb 11

GLM 5 is crazy on benchmarks. HLE even with tools is hard to benchmaxx on, and still they have the highest score in there. Did not expect it to be this good on Terminal bench 2.0 at all, a monster of a model for sure China is cooking

311

shaped

shaped

@shaped

Feb 10

"AI hasn't made anything novel" "AI hasn't innovated anything" "AI is a stochastic parrot" AI's pushing barriers of human knowledge at the bleeding edge fringes like this one. You're just not paying attention

Demis Hassabis

@demishassabis

Feb 10

The drug design engine we’re building at @IsomorphicLabs is extending the SOTA further across key benchmarks, showing huge progress in accuracy and capabilities critical for in-silico drug discovery. Incredible work from @maxjaderberg and the entire team at Isomorphic Labs!

173

shaped

shaped

@shaped

Feb 7

Opus 4.6 is unreasonably good at resolving merge conflicts. I know the meta is Codex 5.3 rn, but just putting this out there.

234