Took a look cause, as frustrating as it’d be, it’d still be a step in the right direction. But no, they’re still adamant that these are just a “quirk”.
Conclusions
We hope that the statistical lens in our paper clarifies the nature of hallucinations and pushes back on common misconceptions:
Claim: Hallucinations will be eliminated by improving accuracy because a 100% accurate model never hallucinates.
Finding: Accuracy will never reach 100% because, regardless of model size, search and reasoning capabilities, some real-world questions are inherently unanswerable.
Claim: Hallucinations are inevitable.
Finding: They are not, because language models can abstain when uncertain.
Claim: Avoiding hallucinations requires a degree of intelligence which is exclusively achievable with larger models.
Finding: It can be easier for a small model to know its limits. For example, when asked to answer a Māori question, a small model which knows no Māori can simply say “I don’t know” whereas a model that knows some Māori has to determine its confidence. As discussed in the paper, being “calibrated” requires much less computation than being accurate.
Claim: Hallucinations are a mysterious glitch in modern language models.
Finding: We understand the statistical mechanisms through which hallucinations arise and are rewarded in evaluations.
Claim: To measure hallucinations, we just need a good hallucination eval.
Finding: Hallucination evals have been published. However, a good hallucination eval has little effect against hundreds of traditional accuracy-based evals that penalize humility and reward guessing. Instead, all of the primary eval metrics need to be reworked to reward expressions of uncertainty.
This further points to the solution being smaller models that know less and are trained for smaller tasks. Instead of gargantuan models that require an insane amount of resources to answer easy questions. Route queries to smaller, more specialized models, based on queries. This was the motivation behind MoE models, but I think there are other architectures and paradigms to explore.
Finding: Accuracy will never reach 100% because, regardless of model size, search and reasoning capabilities, some real-world questions are inherently unanswerable.
Translation: PEBKAC. You asked the wrong question.
Took a look cause, as frustrating as it’d be, it’d still be a step in the right direction. But no, they’re still adamant that these are just a “quirk”.
Infuriating.
Got a link or a title I can google to find the full paper? I’d be really interested in reading it.
This further points to the solution being smaller models that know less and are trained for smaller tasks. Instead of gargantuan models that require an insane amount of resources to answer easy questions. Route queries to smaller, more specialized models, based on queries. This was the motivation behind MoE models, but I think there are other architectures and paradigms to explore.
Translation: PEBKAC. You asked the wrong question.
Basically “You must be prompting it wrong!”