Joined September 2016
304 Photos and videos
Pinned Tweet
Most agent frameworks are LLM-first: the conversation loop is the core, tools are attached to it, rules layer on top, and logging is bolted on at the end for observability. State is persisted as retrievable "memory."
1
6
453
Point 2 is a regulator's nightmare. We will see if this happens faster than the AI Act committees slapping a Corrective Action Plan on non compliant enterprises on data management best practices. This is not a last mile AI problem. This is the whole bottleneck.
the exciting future of “building the hill climbing machine” 🚀 🧗 1. encode domain expertise into a v0 agent and ship it —> this will never be perfect to start but we need to observe agent behavior to understand where/how we can improve 2. Have a trusted, robust system to collect and centralize trace data at scale across your teams (and company broadly). Traces are the lifeblood of agent improvement. 3. Design and deploy efficient methods of mining traces for patterns. “No trace left unread or understood”. It’s difficult to impossible to know exactly how well an agent will do on a task apriori, the best bet is data driven by running it and measuring it. 4. Turn information from traces on failures into evals and environments that you can hill climb against. Every issue we mine in agent traces is a direct signal we can improve with. 5. Use practical recipes to systematically improve agents. For example: harness engineering -> fine tuning (RL, SFT) -> followed by more harness engineering 6. Pick an optimization target and hill climb against with data ruthlessly. If the main metric is a cost-perf tradeoff on a narrow task, then consider tuning frontier open models, embracing a multi-model harness, and designing evals that accurately reflect which models should do which sub-tasks 7. Grow your data competency for agent improvement. Continual Learning imo is in large part a data mining problem. The more you can design systems to capture agent data, design experiments around it, and have feedback loops between humans and evals that you fit against…the better your agents will be on time The goal is to build the human machine system that helps agents improve over time across any task that a business cares about Every team should have easy access to these abilities, we want to make it as easy as possible for every builder to get started on this journey 🤝
1
1
4
1,212
same. I am a @tan_stack and @CloudflareDev for my websites.
It’s finally time. Today, I move my last site off of Vercel Nextjs and onto @Cloudflare with Vite. It’s been fun to actually enjoy building again
2
124
Evals are never-ending. Because evals are a loop, not a score.
Here are some principles you can infer from @satyanadella's paragraph: - There will be a better model tomorrow. - Prompts are great for building POCs, but terrible at specifying system behaviors. - To switch models easily, you need good evals and a system for generating and holding a new prompt accountable for a given model. - With such a system, you can almost certainly use a model magnitudes faster and cheaper than frontier models. - Evals are THE asset for all enterprises. - Evals should never stop growing. 🤔
2
1
7
251
if there are no regulatory pressures, a human is not meant to be in the loop
Vinod Khosla on why he does not really prefer "AI co-pilots". Because he thinks "humans get in the way of co-pilots", which slows everything down and blocks real change. He says workers like accountants and programmers do not actually want co-pilots, because they feel their jobs are at risk and then resist using the tool properly. So instead of “helping” them, he prefers building AI that fully does the job itself, like a complete software engineer. He expects that by 2030, most of these roles will be pure AI workers, not human co-pilot. --- From 'Corgi Insurance' YT channel (link in comment)
1
38
This is what River by Shopify does. And here is my OpenSource implementation of it
Jun 12
this is exactly what we’re building at @leveragecpu a multiplayer AI computer where company communication, files, code changes, agent runs, tool calls, decisions, and memory live in one governed workspace the company should be readable and executable by humans agents at the same time
1
161
That is exactly how my hermes cron jobs run. Set it up once with a smart model. Dry run and wet run multiple times Look for all edge cases and stuff like auth issues and regression threats. Run it like those n8n type automations and profit. I use agents as SRE to post fix.
guy on Reddit with 10 years of engineering experience just shared the one thing he'd teach every vibe coder first And it'll save you thousands in AI costs Most people using Claude Code use it the expensive way. They call the AI every time the tool runs. Every run burns tokens. Every token costs money. His advice: flip it. Use Claude Code to BUILD the tool once. Then run it forever without spending a single token. Simple example. You want to check a website daily for updates. The expensive way: have an LLM search the site every day. Burns tokens every single time. The free way: use Claude Code to write a script that scrapes the page and alerts you if anything changed. Build it once. Runs forever. Zero tokens. Spend tokens once to build it. Run it for free forever. That's it. That's the insight most vibe coders are missing.
1
79
ucalyptus retweeted
This is quite interesting! I made the exact same recipe 8 months ago, but never expected someone to launch this commercially TLDR; you can get Fable performance with Opus using 2x Opus tokens
Introducing the Fusion API, the smartest compound model in the market. Fusion achieves Fable-level intelligence at half the price. How it works 👇
18
10
186
39,951
ucalyptus retweeted
Increasingly thinking about agent design this way myself these days. This is a terrific articulation
3
8
239
76,578
ucalyptus retweeted
Replying to @arunabh_D
Accuracy was never an intelligence problem: it's a coordination problem. A pack of wolves beats a lone tiger every time, and costs less. We should stop bringing a shiny sword to cut a birthday cake. My agents have nailed the a time-tested NASA Work-Breakdown pattern for work.
1
1
1
65
L take. Regulators dont like the sound of permissionless across the board access to company information. Most financial institutions have to work with the regulators. Takes like these reveal how little people in the tech industry understand enterprises.
In twelve months, EVERY company will be running a Company Brain. The teams who build it this year will spend the next year compounding. Everyone else is going to play catch up. Here's what it actually is. You connect your Slack, your GitHub, HubSpot, all your tools into one intelligence layer, then build the org chart around it: a main brain up top, a fleet commander running the agent fleet, specialist sub-agents handling execution. The reason it works is change management basically disappears. Your team already lives in Slack. You're just adding agents to the room they're already in. You NEED to start building yours now. In a year this will stop being an advantage and will become table stakes.
36
ucalyptus retweeted
Been saying this for years, event-driven log-based systems and Agent loops are a match made in heaven. You get time travel debugging, replayability, forking, telemetry, etc. all for free if you adopt this architecture from the start.
1
5
1,717
ucalyptus retweeted
European culture cannot stomach the actual conditions required to build. Giving very young, very cracked people absurd resources, freedom, speed, and trust. They’ll form a committee, write a framework, host a summit, add 50 oversight layers, and that will be it.
it's not lack of compute that's the issze. it's that in Europe, it's unthinkable to pay a guy in his mid 20s $600k salary and give him resources and freedom to train models without having oversight by a committee of gerontocratic professorswho don't keep up with the research
27
55
1,106
77,931
ucalyptus retweeted
What comes next: The govt will rollout an emergency citizenship program for any foreign-born employee working in a lab contingent on them immediately moving to the U.S. Everyone will be heavily vetted via the same screening construct already utilized by the defense primes. Google will have to move the entirety of DeepMind to the U.S. and fire whoever refuses to relocate. People will gleefully assume Demis will just start his own UK lab instead before realizing the next step is the US is about to gut foreign “unmonitored” access to compute. You can pull a LeCun but you won’t have sufficient compute to do shit. Greencards will be given to family members too. Foreign govts will freak out when they realize what is happening. We are gatekeeping and hoarding intelligence preemptively. Why? Because by GPT 7 France will be like “oh you just destroyed our services sector we are going to tax the labs to pay for the necessary benefits to prevent riots” and it’s a lot easier to do that if labs have critical employees based in Paris. Ditto for every other foreign nation. Anyone acting like this is surprising is simply incapable of thinking four steps ahead. We are going to see industries nuked over night. There will be civil unrest. The only way to navigate that is to tax and gatekeep. The only way you can tax something is if it lives in your borders. We are repatriating exposure points preemptively. Compute gatekeeping comes next. 🫡
105
154
1,912
198,548
ucalyptus retweeted
Logs are all you need
2
3
33
11,972
ucalyptus retweeted
According to Grok, Andrej Karpathy is an EB-1 extraordinary ability green card recipient, not a US citizen. Thus under these new restrictions he is not permitted to use, or work on, Mythos 5 or Fable 5 as of 5:21pm tonight.
Replying to @AndrewCurran_
From the statement: 'The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, 𝘪𝘯𝘤𝘭𝘶𝘥𝘪𝘯𝘨 𝘧𝘰𝘳𝘦𝘪𝘨𝘯 𝘯𝘢𝘵𝘪𝘰𝘯𝘢𝘭 𝘈𝘯𝘵𝘩𝘳𝘰𝘱𝘪𝘤 𝘦𝘮𝘱𝘭𝘰𝘺𝘦𝘦𝘴. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance.'
177
540
7,542
816,779
The Agent will be durable event log and not the model runtime. @ishaansehgal is a must follow
1
1
2
511
while we live in EA permanent overclass, we have @xai that is e/acc
some thoughts on the difference between oai and anth 1) i think the most important difference between openai and anthropic is cultural; basically, openai is faang while anthropic is ea 2) openai doesn't really believe anything; it's a lot of tech workers from meta, etc... most of the employees are not particularly agi-pilled 3) anthropic has roots in the ea ecosystem; it employs holden karnofsky, joe carlsmith, amanda askell, etc... all deeply connected to effective altruism 4) the founders essentially took an ea pledge; the series a was led by sbf, the series b was led by jaan tallinn (ea or ea adjacent) 5) this gives anthropic a core set of values and an orientation towards the future; it inherits effective altruism's focus on forecasting and ai safety 6) and, it gives it a set of institutions to recruit from and to reinforce its values: open philanthropy, mats, metr, constellation, redwood research, etc... 7) not that these organizations are all ea in particular, but they are part of a larger ea and ea funded ecosystem that curates a particular worldview and set of attitudes 8) openai has no equivalent; there is no concentrated place from which it draws talent or from which it draws a coherent worldview; big tech isn't exactly a reservoir of ideas 9) i think this may be one of the reasons that openai fumbled with their focus on ads and sora rather than focusing on automating the office worker; they were not future focused enough 10) and, it's one of the reasons why it is easier for anthropic to retain talent, there are a lot of voices telling the people working there they are doing the right thing and the most important thing 11) but, believing in nothing in particular, other than the zeitgeist, has its benefits, it can be easier to make deals (see openai's dod deal); and believing in things can make it harder 12) and, this tendency may make openai more responsive to the general political milieu and will probably make it more likely to try to follow the democratic policy discussions rather than lead them 13) and, it is an open question whether we want companies to lead policy making on ai in democratic countries or whether it is better to allow broader society to do this
2
76