How do we train and evaluate Search Agents? πΎπ
I am SUPER EXCITED to publish a new episode of the Weaviate Podcast with Nandan Thakur (
@beirmug) on Search Agents! ποΈπ
Firstly, congratulations to Nandan who has just completed his Ph.D. at the University of Waterloo advised by Professor Jimmy Lin (
@lintool)! π
During this time, Nandan published several impactful works such as BEIR π», MIRACL πππ, FreshStack π₯, and many more.
This podcast dives into his new work on ORBIT and the current state of Search Agents! βοΈ
ORBIT contains 20K training examples, each one a complex, multi-hop question paired with a short verifiable answer. For example, "What was the runtime of the 2017 animated film set inside a smartphone, directed by..." (Answer: 86 minutes). π¬
This dataset is used to train Search Agents on queries that require say 4 to 5 searches in order to answer.
The crazy part is that ORBIT was generated entirely without paid Web Search APIs! The entire pipeline runs on a 2018 Linux laptop driving DeepSeek's free chat interface! π»β»οΈ
Trained on ORBIT, Qwen3-4B beats InfoSeeker-4B by 4.3 EM and Search-R1-4B by 9.0 EM across 7 Wikipedia QA benchmarks.
A lot of interesting nuggets in this one! As always, I hope you find it useful and happy to discuss further! π