Also what I've seen.
I think this image illustrates the capabilities of large language models very effectively.
LLMs are great at recombining existing knowledge. So, for questions outside your domain of expertise, or far from the frontier of knowledge, they are often much better than the average human. Here they can be incredibly helpful.
However, as you move closer to the frontier of knowledge, they become much worse.
Here, even the average human can become better.
I have seen this many times with my own eyes. When I work with an LLM at the frontier of knowledge, it often makes absurd mistakes that no intelligent person would make.
Internal contradictions within a few lines, dramatic forgetting of what happened two interactions earlier, and so on.
This limitation is literally built into the model: it approximates the most likely continuation given the previous input.
If there is enough relevant structure in the training data, it can perform very well. If it does not really know where to go, the output quickly becomes messy, and randomness takes over.