New blog: I am worried by NLP research culture
NLG and NLP are mostly much better in 2026 than when I got my PhD in 1990. Unfortunately research culture has gotten *worse” in this period, which really worries me as I retire.
ehudreiter.com/2026/06/08/nl…
Took lots of noce pictures at my Retroeval retirement symposium. One of my favourites was me with my first and last PhD students, Sandra Williams and Yujun Wang!
I'm looking forward to my retirement workshop, which starts on Monday 1 June!! Will be great to catch up with former students and colleagues, and also discuss NLG evaluation.
retroeval.github.io/
I remember seeing very dubious advice from OpenAI a few years ago on evaluation. So I was happy to see quite sensible recent advice from Anthropic on evaluation
anthropic.com/engineering/de…
Really interesting scoping review that points out numerous flaws in LLM-as-Judge evaluation in healthcare, including minimal human oversight, absent bias testing, model monoculture, ignore implicit eval components, no check for consistency over time (etc)
arxiv.org/abs/2604.25933
Someone asked me what were the highlights of my career, I responded with a list of papers which I was proud of. I did not mention grants, awards, jobs, etc. I know some people are proudest of their grants (etc), but for me it was always scientific outputs.
I wrote paper on "NLG Evaluation: Past, Present, Future" for Retroeval. Eval has changed enornously over my career! In future, I expect more on stuff relevant to real-world usage, including impact, qualitative studies, safety in worst/adversarial case
arxiv.org/abs/2605.23715
New blog: Software engineering of prompts
Creating complex prompts for LLMs faces similar software engineering challenges as conventional software (requirements, design, testing, maintenance). We need to understand good software engineering for prompts.
ehudreiter.com/2026/05/20/so…
Congrats to my student Jawwad Baig for passing his PhD viva! Topic was “Data-to-Text NLG Feedback for Safer Driving”. Jawwad did his PhD part-time (ie, evenings and weekends while he worked fulltime) and remote (lives in England), which is very tough, but he still completed
The Call for Papers for #INLG2026 is out!
🗓️ Submit by July 15 (AoE)
💍 ARR commit by August 5
🆕 Squibs welcomed (raising an issue without needing to solve it)
🆕 Non-archival track for WIP
📍Utrecht, NL — Oct 17–21, just before EMNLP
2026.inlgmeeting.org/calls.h…#NLProc#INLG
The penalty is a 1-year ban from arXiv followed by the requirement that subsequent arXiv submissions must first be accepted at a reputable peer-reviewed venue. 4/
Have now resigned as ARR (meta-)reviewer. I will continue to do some reviewing after retirement, but not ARR. I dont think mega-conf are the right way to present important research findings, and ARR reviewing is not enjoyable, eg I have no control over what I am asked to review
Asked about students cheating in CS using AI. Said I was not concerned about cheating distorting marks, but was very concerned that it demotivated students from learning. I gave assess which AI cannot do, failure rate skyrocketed compared to prev year
ehudreiter.com/2026/05/05/ai…
New blog: AI and CS Teaching
How will AI impact CS teaching? Biggest challenge is adapting what we teach to a world where AI assistants are heavily used. We should also use AI tutors. Least important is making assessments more resistant to AI cheating
ehudreiter.com/2026/05/05/ai…
What % of the NLP papers measure their impact in the real world? This paper proposes an "impact evaluation" of NLP models or systems for real-world usage, changing the research culture of NLP to focus more on real-world
impact and less on SOTA-chasing: doi.org/10.1162/COLI.a.18
Our final year UG students turn in their honours projects today. Supervising projects is the nicest part of teaching for me - always learning something, and great to supervise students 1-1. Really nice projects this year on evaluating LLM in real-world, and digital humanities.