Most are published based on inflated metrics, break down when probed in realistic application scenarios, and are outperformed by simple baselines.
We built DrEval, a pipeline for unbiased evaluation of these models, to encourage more meaningful progress in the field.