Evidently AI (@EvidentlyAI)

9 Dec 2024

3️⃣ 2️⃣ 1️⃣ Our free course on LLM evaluations for AI product teams starts today! 🎥 7 days of byte-sized videos into your inbox ⭐️ Certificate upon completion 👩‍💻 No coding skills required 👩‍🎓500 students have signed up You can still join the course👇 evidentlyai.com/llm-evaluati…

1,813

Evidently AI

Apr 24

How Zalando builds a search quality assurance framework with LLM-as-a-judge: engineering.zalando.com/post…

Zalando Engineering Blog - Search Quality Assurance with AI as a Judge

Deep dive into how Zalando builds a search quality assurance framework with LLM-as-a-judge to evaluate the search quality at scale with high coverage and multi-language support.

engineering.zalando.com

Evidently AI

Apr 4

📌 In case you missed it How to evaluate an AI agent? Follow the tutorial as we: 1️⃣ Build an AI agent, 2️⃣ Create a test dataset, 3️⃣ Assess responses and tool choice, 4️⃣ Track the agent’s behaviour. Follow the tutorial from our LLM evals course: youtube.com/watch?v=9KMmadw7…

7. Tutorial: Building and evaluating an AI agent

Code example: https://github.com/evidentlyai/community-examples/blo...

176

Evidently AI

Apr 3

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Uber improves driver availability at airports: Estimated time-to-request model, Earnings-per-hour prediction, and Driver-deficit forecasting. uber.com/en-GB/blog/forecast…

Evidently AI - ML and LLM system design: 800 case studies

How do top companies apply AI? A database of 800 case studies from 150 companies with practical ML use cases, LLM applications, and learnings from designing ML and LLM systems.

131

Evidently AI

Apr 1

🦾 More AI agents aren’t always better. Google evaluated 180 agent setups and found multi-agent systems help with parallel tasks but can hurt sequential ones. The work also proposes a model to predict optimal agentic designs. research.google/blog/towards…

Evidently AI

Mar 28

📌 In case you missed it Let’s test your RAG system! Follow the tutorial as we: 1️⃣ Build a RAG system, 2️⃣ Generate test data, 3️⃣ Evaluate answers for correctness and faithfulness. Watch the tutorial from our LLM evals course: youtube.com/watch?v=jckp5R09…

6.2. Tutorial: Building and evaluating a RAG system

Code example: https://github.com/evidentlyai/community-examples/blo...

123

Evidently AI

Mar 27

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How GoDaddy built Lighthouse, an internal AI analytics platform: prompt engineering framework, model orchestration, solution architecture, and use cases. godaddy.com/resources/news/h…

Evidently AI - ML and LLM system design: 800 case studies

How do top companies apply AI? A database of 800 case studies from 150 companies with practical ML use cases, LLM applications, and learnings from designing ML and LLM systems.

Nnenna 👩🏽‍💻✨

Evidently AI retweeted

Nnenna 👩🏽‍💻✨

@nnennahacks

Mar 24

(policyNIM oss tool) preflight command is working. when I provide a coding task, it kicks off a search through indexed policies to determine which rules are relevant for implementation. @nvidia for embedding w/ @OpenAI @lancedb for vector storage. eval command is also working. using @EvidentlyAI for running eval suite.

429

Evidently AI

Mar 24

🚦 Meta’s “Agents Rule of Two” According to Meta, AI agents should satisfy at most two of these conditions per session to reduce prompt-injection risk: - Handle untrusted inputs - Access sensitive data - Change state / act externally ai.meta.com/blog/practical-a…

Agents Rule of Two: A Practical Approach to AI Agent Security

We've developed the Agents Rule of Two. When this framework is followed, the severity of security risks is deterministically reduced.

ai.meta.com

Evidently AI

Mar 21

📌 In case you missed it How do you know if your RAG works? You need to check: ✅ Can it find the right information? ✅ Is the final answer complete, relevant, and free of hallucinations? Watch the intro to RAG evaluation from our LLM evals course: youtube.com/watch?v=qI2qQfOG…

6.1 How to evaluate a RAG system: methods and metrics

00:03 Intro00:24 What is RAG?01:03 How to evaluate RAG? Look at b...

166

Evidently AI

Mar 20

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How DoorDash improves its RecSys using LLMs to bridge behavioral silos in multi-vertical recommendations. careersatdoordash.com/blog/d…

Evidently AI - ML and LLM system design: 800 case studies

How do top companies apply AI? A database of 800 case studies from 150 companies with practical ML use cases, LLM applications, and learnings from designing ML and LLM systems.

Evidently AI

Mar 18

💭 Can AI systems introspect? Anthropic’s new research suggests Claude models can sometimes identify and describe their own internal states. It’s still unreliable, but marks a step toward more transparent AI reasoning. anthropic.com/research/intro…

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

anthropic.com

Evidently AI

Mar 14

📌 In case you missed it Can LLMs write engaging tech tweets? Follow the tutorial as we: 1️⃣ Build a tweet generator, 2️⃣ Score its outputs with custom LLM judges, 3️⃣ Improve the results with prompt iteration. Watch the tutorial from our LLM evals course: youtube.com/watch?v=KhkiM9C0…

5. Tutorial: Evaluating LLMs on content generation tasks. Tracing and...

Code example: https://github.com/evidentlyai/community-examples/blo...

176

Evidently AI

Mar 13

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Shopify transformed its product classification system from basic categorization to an AI-driven framework using Vision Language Models. shopify.engineering/evolutio…

Evidently AI - ML and LLM system design: 800 case studies

How do top companies apply AI? A database of 800 case studies from 150 companies with practical ML use cases, LLM applications, and learnings from designing ML and LLM systems.

Evidently AI

Mar 10

📚 Context is everything. OpenAI shares how it built an in-house data agent that answers complex questions in minutes. It uses 6 layers of context: - Table metadata - Human annotations - Codex enrichment - Company knowledge - Memory - Runtime context openai.com/index/inside-our-…

Inside OpenAI’s in-house data agent

How OpenAI built an in-house AI data agent that uses GPT-5, Codex, and memory to reason over massive datasets and deliver reliable insights in minutes.

openai.com

113

Evidently AI

Mar 7

📌 In case you missed it Are LLMs good for classification tasks? We built an LLM-based classifier for a travel support chatbot and compared its performance to a classic ML model. Watch the tutorial from our LLM evals course: youtube.com/watch?v=Gl2X_o99…

4. Tutorial: Evaluating LLMs on classification tasks

Code example: https://github.com/evidentlyai/community-examples/blo...

156

Evidently AI

Mar 6

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Wayfair built Wilma, a customer service agent copilot: workflow, prompt templates, and the copilot’s evolution. aboutwayfair.com/careers/tec…

Evidently AI - ML and LLM system design: 800 case studies

How do top companies apply AI? A database of 800 case studies from 150 companies with practical ML use cases, LLM applications, and learnings from designing ML and LLM systems.

Evidently AI

Mar 4

🤖 How to develop and deploy chatbots at scale? DoorDash shares how they created a simulation platform and evaluation flywheel, allowing them to test chatbots with fast feedback loops and without production risk. careersatdoordash.com/blog/d…

A Simulation and Evaluation Flywheel to Develop LLM Chatbots

DoorDash scales LLM chatbots with a simulator and evaluation flywheel to automate evals, reduce hallucinations, and unblock testing.

careersatdoordash.com

Evidently AI

Feb 28

📌 In case you missed it How to create an LLM judge that aligns with human labels: - Define criteria - Create test dataset - Run evaluation prompt to see if the judge aligns with your labels - Evaluate the judge Watch the video from our LLM evals course: youtube.com/watch?v=kP_aaFnX…

3. Tutorial: How to create an LLM judge and align with human labels

Example notebook: https://github.com/evidentlyai/community-examples...

169

Evidently AI

Feb 27

A Friday ML use case 📕 📚 From the database of 800 ML & LLM systems: cutt.ly/SwrZWL0g How Wayfair uses AI agents to automatically triage support tickets: agents vs. workflows and a hybrid approach. aboutwayfair.com/careers/tec…

Evidently AI - ML and LLM system design: 800 case studies

How do top companies apply AI? A database of 800 case studies from 150 companies with practical ML use cases, LLM applications, and learnings from designing ML and LLM systems.

Evidently AI