We spent the first year at
@GentraceAI helping customers run traditional LLM evaluations. But we realized late last year that agents changed everything.
Agents don't just generate text - they take actions, make decisions, and interact across multiple steps.
Traditional "hallucination" checks can't evaluate whether your agent booked the right flight or resolved the customer’s support ticket correctly.
Old LLM observability tools were built for simple input/output prompts. They break in an agentic world.
For the last 9 months, we rebuilt Gentrace from first principles 🔥 Here's our new approach:
- Chat with your AI trace data using our agent to discover what's actually breaking
- Describe problems in plain English
- Our agent creates custom AI-powered monitoring columns that catch these failures across all future traces
Example: Instead of checking "factual accuracy," you can now evaluate complex behaviors:
- "Figure out if a user is frustrated with my AI agent in the customer support chat"
- "Detect when my agent takes more than 5 steps to answer simple queries."
We stopped measuring outputs and started measuring outcomes. Watch
@dougsafreno break down how to solve agent observability with Gentrace. 👇
Agents are significantly more powerful than standalone LLM calls. But, debugging them is a nightmare.
You can trace their reasoning and tool use, but traces get huge and are impossible to parse.
To solve this, we spent the lasts several months building Gentrace for Agents, which puts our own agents to work on yours.
In Gentrace for Agents, you can:
• Chat with AI to debug agent traces
• Create smart monitoring columns
• Build out tailored evaluations
It’s like a giant AI powered spreadsheet over your trace data, with a Cursor-style chat sidecar. If it sounds a little meta, it is, but it is very powerful in practice.
We recorded this video to show you how it works. Take a look, and let me know what you think: