The "most accurate" LLM for systematic-review screening silently discarded 63% of the relevant papers.
Our new open-access paper @ISTJrnal (Information and Software Technology) shows why standard metrics mislead — and what to use instead. 🧵
doi.org/10.1016/j.infsof.202…
LLM4SCREENLIT = recommendations for authors AND a one-page checklist for editors/reviewers, split by study type (benchmarking vs deployment). Validated on 9 LLMs × 24 SE secondary studies (34,528 articles). With Prof. Barbara Kitchenham & @ProfMShepperd .
Personalized Share Link to our new @ISTJrnal paper "Test case prioritization: A systematic review using snowballing and TCPFramework with approach combinators":
authors.elsevier.com/a/1me%7…
📊 How does it work?
The graphical abstract below presents a simplified view of our framework.
Test suites are passed through different combinations of simple models to produce a highly efficient test ordering—without the need for heavy computation.
👇
💡 The Results:
By integrating existing strategies, approach combinators consistently improve regression testing.
The ultimate takeaway? We achieve state-of-the-art TCP performance across diverse software projects! 🏆
#QA#CICD#AcademicTwitter
Tests based on pˆ always had better or equal power than tests based on Cliff’s d, and across all but one simulation condition, pˆ Type 1 error rates were less biased.
Conclusions: Using pˆ is a low-risk option for analysing and meta-analysing data from small sample-size SE randomized experiments. Parametric methods are only preferable if you have prior knowledge of the data distribution.