Briefly, we did a systematic evaluation of different approaches (rule-based, ML feature-based, and E2E), using the task of generating referring expressions in context as an example.
We examined 2 different datasets (WEBNLG & WSJ) and performed both automatic & human evaluations.