ALT Spearman correlations between rankings produced by human-assessed quality facets (F1-F4), automated metrics (M1-M7), and combined pairwise system rankings (PW-combined) on the Cochrane MSLR dataset. Rankings from automated metrics are highly correlated as a group except for PIO-Overlap (A). PIO-Overlap rankings are strongly correlated with rankings from human-assessed facets, especially PIO agreement (B). Metrics most strongly associated with PW-Combined rankings are Delta-EI and PIO-Overlap (C). Rankings from commonly reported automated metrics like ROUGE and BERTScore are not correlated or anti-correlated with human-assessed system rankings (D).
MS^2 focuses on extraction and summarization in the review pipeline. We harvest 20K systematic reviews and 470K of their references from Semantic Scholar, identify summary targets, and experiment with multi-document summarization methods. 2/3
AI safety will be an important part of any system performing these tasks in the wild. There’s a lot of work to do to ensure the quality and reliability of model outputs. We encourage the community to work on these challenging and important problems! 3/3
1/ New work by Alican (@alicanb_) and Babak (@BabakEsmaeili10): "Evaluating Combinatorial Generalization in Variational Autoencoders" (arxiv.org/abs/1911.04594)
In this paper we ask the question: "To what extent do VAEs generalize to unseen combinations of features?"(thread)
#NLProc does not have a standard benchmark for interpretability. I am stoked to announce ERASER: the first-ever effort on unifying and standardizing NLP tasks with the goal of interpretability.
eraserbenchmark.com/