This paper warns that using large language models for labeling text can often lead to wrong research conclusions.
Here, LLM hacking means the final statistical claim flips depending on model, prompts, or settings, not the underlying data.
The problem comes from what the authors call LLM hacking, where results flip depending on which model, prompt, or setup is chosen, not on the actual data.
They tested 37 real research tasks with 18 different models and found that incorrect results happened in about 31% to 50% of cases.
These errors include missing real effects, inventing effects that are not there, reporting the wrong direction, or exaggerating the size of an effect.
The risk is especially high when results are near the usual significance cutoff, which is where many social science studies operate.
They also show that 100 human labels can be more reliable than 100K LLM labels, especially for avoiding false discoveries.
Correction methods that adjust results after the fact do not really solve the issue, since they reduce one type of error but increase another.
Finally, they show it is very easy for someone to deliberately game results by trying different models and prompts until they get the answer they want.
----
Paper – arxiv. org/abs/2509.08825
Paper Title: "LLM Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation"