BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

8 Photos and videos

Tweets

Pinned Tweet

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 1

We are live on @ycombinator ! Building the monitoring and learning layer for long-running agents.

Y Combinator

@ycombinator

Jun 1

.@BentoLabsAI is the monitoring and learning layer for long-running agents. Their learning layer gives agents model-jump gains: Sonnet 4.5 went 42.2%→52.4% on TB2 (Internal). Congrats on the launch, @Abhinavv_soni & @kacppian! ycombinator.com/launches/Qcw…

1:58

878

Abhinav Soni

BentoLabs AI (YC P26) retweeted

Abhinav Soni

@Abhinavv_soni

Jun 13

Hot take A model upgrade isn't a library bump. It's a personality transplant. Stop shipping them same-day.

109

BentoLabs AI (YC P26)

BentoLabs AI (YC P26) retweeted

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 12

Did you see 'Harness Engineering' on your feed too? A term that suddenly blew up everywhere. It didn't have a name until February, now Hashimoto has posted on it and OpenAI has proven it at a million lines. So what does the work actually look like? Reading a hundred trajectories for patterns, designing tool interfaces as prompt fragments, encoding every fix where the agent inherits it. Some production bugs get solved by nothing more than renaming a tool. We broke down the full discipline on the blog, go have a read here: bentolabs.ai/blog/harness-en…

Harness Engineering: the underrated discipline of production AI

Harness engineering is the next evolution after prompt and context engineering — and most enterprise teams still don't have anyone whose dedicated job it is.

bentolabs.ai

129

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 13

When your agent regresses, what/who do you reach for first?

50% The traces

0% The recent changes

0% Teammates

50% Honestly, no idea

2 votes • Final results

140

Abhinav Soni

BentoLabs AI (YC P26) retweeted

Abhinav Soni

@Abhinavv_soni

Jun 12

One of the most important thing I couldn't have justified before: trust your instinct on prompt and agent design. You'll feel something is wrong in a prompt or a trajectory long before you can explain why. That feeling is real signal. It's built from reading hundreds of runs, and it fires before the analysis catches up. Don't dismiss it because you can't name it yet. Note it, watch for it again, and the explanation will come. The instinct comes first. The words come after.

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 12

Agents fail silently in production Traditional monitoring misses agent-specific issues Teams debug same problems repeatedly No learning from the runs ✅ BentoLabs fixes all of this. Ready to stop debugging blind? → bentolabs.ai

Bento — Self-learning infrastructure for AI agents

Monitor what runs. Improve what fails. Evaluate what changes.

bentolabs.ai

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 11

The problem with patching long-running agents through a chat session: Three weeks later, a similar pattern recurs. Nobody remembers what was tried last month. Six months later, an auditor asks why the agent handles this category of cases this way. Nobody can produce a coherent answer. You can vibe-code a fix. You cannot vibe-code an improvement program. bentolabs.ai

Bento — Self-learning infrastructure for AI agents

Monitor what runs. Improve what fails. Evaluate what changes.

bentolabs.ai

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 11

Catch the decision, not the symptom. By the time the output is wrong, the agent made the wrong call several steps earlier. Watch the trajectory, not just the final answer.

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 10

The scariest agent bugs are the polite ones. No stack trace. Just an answer that is confidently a little bit wrong, a few thousand times, before anyone notices.

BentoLabs AI (YC P26)

BentoLabs AI (YC P26) retweeted

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 9

Hot take! Adding another model to your stack will not make your agent reliable. It will give you two things to debug and zero memory of either.

136

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 8

Friday night fix. Monday morning regression. Sound familiar? 30 traces in your clipboard. A brilliant fix. You shipped it. The 999 trajectories you never looked at had other plans. That's the problem. You can vibe-code a fix. You cannot vibe-code an improvement program. Full breakdown → bentolabs.ai/blog/vibe-codin…

224

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 6

What's your system for handling this in production?

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 5

Does your agent actually learn from its failures, or just log them?

0% It learns and improves

0% It logs, we fix manually

0% It logs, nobody looks

0% What learning

0 votes • Final results

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 5

Teams running agents in production, what is one thing you wish your monitoring could tell you that it currently cannot?

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 4

Why agents that work in staging often degrade in production? It's usually a diagnostic failure. Users use your agents in ways you can't even imagine, that results in failures that are even harder to catch and work on. Our framework helps you spot which layer is actually breaking. Read here 👇🏻 bentolabs.ai/blog/nature-vs-…

Nature vs. Nurture in AI agents: diagnose the layer that's actually breaking

AI agents that work in pilots often fail in production. Usually a diagnostic failure, not a capability one — how to find the layer that's breaking.

bentolabs.ai

221

BentoLabs AI (YC P26)

BentoLabs AI (YC P26)

@BentoLabsAI

Jun 2

We ran our recursive learning layer on Terminal-Bench 2.0. Same agent. Same model. Same harness. Same budget. The result: Claude Sonnet went from 42.2% → 52.4%. A 10.2 percentage-point lift, significant at p < 0.05, with a 13:3 task-level win/loss ratio (internal). The only variable was a learning layer. We wrote a full technical breakdown on what changed, why it worked, and what this means for production AI agents. Read it here 👇 bentolabs.ai/blog/tb2-recurs…

290