Thank you, Professor
@Zhou_Yu_AI and
@bklsummithouse, for the AI Agents in Action: Industry ร Academia Exchange!
@rebeccatqian, our CTO, was on a panel with Vinay Rao (Advisor at
@AnthropicAI),
@ShunyuYao12 (Research Scientist at
@OpenAI), Robert Parker (Founder of Perceptix), and
@alsuhr (Professor at
@Berkeley_EECS) discussing AI agents.
These are some takeaways:
๐ง๐ฟ๐ฎ๐ป๐๐ถ๐๐ถ๐๐ฒ ๐ฐ๐น๐ผ๐๐๐ฟ๐ฒ ๐ถ๐ ๐ป๐ฒ๐ฒ๐ฑ๐ฒ๐ฑ ๐๐ผ ๐ธ๐ฒ๐ฒ๐ฝ ๐บ๐๐น๐๐ถ-๐๐๐ฒ๐ฝ ๐ฎ๐ด๐ฒ๐ป๐๐ ๐ถ๐ป ๐ฐ๐ต๐ฒ๐ฐ๐ธ.
* Prompt engineering and guardrailing outputs are short-term fixes.
* AgentOS โ may be right around the corner.
๐๐ด๐ฒ๐ป๐ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป, ๐น๐ถ๐ธ๐ฒ ๐๐ฒ๐น๐ณ-๐ฑ๐ฟ๐ถ๐๐ถ๐ป๐ด ๐ฐ๐ฎ๐ฟ ๐๐ฒ๐๐๐ถ๐ป๐ด, ๐ถ๐ ๐ฑ๐๐ป๐ฎ๐บ๐ถ๐ฐ, ๐๐ผ๐ผ๐น-๐ฟ๐ถ๐ฐ๐ต, ๐ฎ๐ป๐ฑ ๐ณ๐ฎ๐ถ๐น๐๐ฟ๐ฒ-๐ฝ๐ฟ๐ผ๐ฝ๐ฎ๐ด๐ฎ๐๐ถ๐ป๐ด.
* It is not a one-size-fits-all approach that evaluates the end output.
* The performance primitives are compute, quality, and cost.
๐๐๐ฎ๐น๐๐ฎ๐๐ถ๐ผ๐ป๐ ๐๐ต๐ผ๐๐น๐ฑ ๐ต๐ฎ๐ฝ๐ฝ๐ฒ๐ป ๐ฎ๐ ๐๐ต๐ฒ ๐๐ผ๐ฟ๐ธ๐ณ๐น๐ผ๐ ๐น๐ฒ๐๐ฒ๐น, ๐ป๐ผ๐ ๐ฎ๐ ๐ฎ ๐๐๐ฎ๐๐ถ๐ฐ ๐ฐ๐ต๐ฒ๐ฐ๐ธ๐ฝ๐ผ๐ถ๐ป๐.
* Agent processes are dynamic with dozens of failure modes.
* Companies want to score workflows across latency, safety, and e2e success.
๐ฃ๐ฒ๐ฒ๐ฟ-๐๐ผ-๐ฝ๐ฒ๐ฒ๐ฟ ๐ฝ๐ฟ๐ผ๐๐ผ๐ฐ๐ผ๐น๐ ๐ฐ๐ฎ๐ป ๐ถ๐บ๐ฝ๐ฟ๐ผ๐๐ฒ ๐๐ฎ๐ณ๐ฒ๐๐ ๐๐๐ฎ๐ป๐ฑ๐ฎ๐ฟ๐ฑ๐ ๐ฎ๐ป๐ฑ ๐ฎ๐ฐ๐ฐ๐ฒ๐น๐ฒ๐ฟ๐ฎ๐๐ฒ ๐ถ๐ป๐๐ฒ๐ด๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐.
* Agreements to accept or deny calls to protect the integrity of the system.
* MCPs need brakes to ensure and promote the future growth of AI applications.
๐๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ๐ ๐ผ๐ป ๐ฑ๐ผ๐บ๐ฎ๐ถ๐ป-๐๐ฝ๐ฒ๐ฐ๐ถ๐ณ๐ถ๐ฐ ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐๐ ๐ฐ๐ฎ๐ป ๐ฏ๐ฟ๐ถ๐ฑ๐ด๐ฒ ๐ถ๐ป๐ฑ๐๐๐๐ฟ๐ ๐ฎ๐ป๐ฑ ๐ฎ๐ฐ๐ฎ๐ฑ๐ฒ๐บ๐ถ๐ฎ ๐ด๐ฎ๐ฝ๐.
* Community benchmarks help democratize knowledge and accelerate domain-specific progress.
* Collaborations with practicing professionals in industry have allowed the creation of industry-standard, domain-grounded datasets, ex. FinanceBench.
Weโre excited to continue the conversation on Agentic AI and for whatโs next!
You can read more about our work on agentic evaluation here:
patronus.ai/percival