Will GPT-4 label this paper as relevant to the query "Baby Yoda"?
In this study with
@IR_oldie,
@pt_ir and F. Scholer, we found that despite good performance in aggregate, e.g. human-like measures of Cohenβs π
and Krippendorffβs πΌ, competitive LLMs are likely to be influenced by the presence of query words in the labelled passages, even if those passages are constructed from random words. (1/6)