Abstract

Large language models (LLMs) are notorious for hallucinating, i. e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.

Video

Introduction

Large language models (LLMs) have become a ubiquitous and versatile tool for addressing a variety of natural language processing (NLP) tasks. People use these models for tasks including information search Sun et al. (2023b), to ask medical questions Thirunavukarasu et al. (2023), or to generate new content Sun et al. (2023a). Recently, there has been a notable shift in user behavior, indicating an increasing reliance on and trust in LLMs as primary information sources, often surpassing traditional channels. However, a significant challenge with the spread of these models is their tendency to produce «hallucinations», i.e., factually incorrect generations that contain misleading information Bang et al. (2023); Dale et al. (2023). This is a side-effect of the way modern LLMs are designed and trained Kalai and Vempala (2023).

LLM hallucinations are a major concern because the deceptive content at the surface level can be highly coherent and persuasive. Common examples include the creation of fictitious biographies or the assertion of unfounded claims. The danger is that a few occasional false claims might be easily obscured by a large number of factual statements, making it extremely hard for people to spot them. As hallucinations in LLM outputs are hard to eliminate completely, users of such systems could be informed via highlighting some potential caveats in the text, and this is where our approach can help.

Fact-checking is a research direction that addresses this problem. It is usually approached using complex systems that leverage external knowledge sources Guo et al. (2022); Nakov et al. (2021); Wadden et al. (2020). This introduces problems related to the incomplete nature of such sources and notable overhead in terms of storing the knowledge. We argue that information about whether a generation is a hallucination is encapsulated in the model output itself, and can be extracted using uncertainty quantification (UQ) Gal et al. (2016); Kotelevskii et al. (2022); Vazhentsev et al. (2022, 2023a). This avoids implementing complex and expensive fact-checking systems that require additional computational overhead and rely on external resources.

Prior work has mainly focused on quantification of uncertainty for the whole generated text and been mostly limited to tasks such as machine translation Malinin and Gales (2020), question answering Kuhn et al. (2023), and text summarization van der Poel et al. (2022). However, the need for an uncertainty score for only a part of the generation substantially complicates the problem. We approach it by leveraging token-level uncertainty scores and aggregating them into claim-level scores. Moreover, we introduce a new token-level uncertainty score, namely claim-conditioned probability (CCP), which demonstrates confident improvements over several baselines for seven LLMs and four languages.

To the best of our knowledge, there is no previous work that has investigated the quality of claim-level UQ techniques for LLM generation. Therefore, for this purpose, we construct a novel benchmark based on fact-checking of biographies of individuals generated using a range of LLMs. Note that different LLMs produce different outputs, which generally have higher variability than, e.g., outputs in such tasks as machine translation or question answering. Therefore, we compare the predictions and uncertainty scores to the results of an automatic external fact-checking system FactScore Min et al. (2023). Human evaluation verifies that our constructed benchmark based on FactScore can adequately evaluate the performance of the uncertainty scores.

Our contributions are as follows:

  • We propose a novel framework for fact-checking LLM generations using token-level uncertainty quantification. We provide a procedure for efficiently estimating the uncertainty of atomic claims generated by a white-box model and highlighting potentially deceptive fragments by mapping them back to the original response.
  • We propose a novel method for token-level uncertainty quantification that outperforms baselines and can be used as a plug-infact-checking framework.
  • We design a novel approach to evaluation of token-level UQ methods for white-box LLMs based on fact-checking, which can be applied to other white-box LLMs.
  • We provide an empirical and ablation analysis of the method for fact-checking of LLM generations, and find that the uncertainty scores we produce can help to spot claims with factual errors for seven LLMs over four languages: English, Chinese, Arabic, and Russian.
  • The method is implemented as a part of the LM-Polygraph library Fadeeva et al. (2023). All the code and data for experiments is publicly available1.

Swiper

×

Lightbox

Before-after

Before After

Swiper with buttons

Charts

The comparison of token-level uncertainty quantification methods in terms of ROC-AUC scores, measured for Chinese dataset. The results are split into bins when considering only facts from the first 2, 5, and all sentences.

The comparison of token-level uncertainty quantification methods in terms of ROC-AUC scores, measured for Chinese dataset. The results are split into bins when considering only facts from the first 2, 5, and all sentences.

The comparison of token-level uncertainty quantification methods in terms of ROC-AUC scores, measured for Chinese dataset. The results are split into bins when considering only facts from the first 2, 5, and all sentences.

ROC-AUC of claim-level UQ methods with manual annotation as the ground truth.

Model Yi 6b, Chinese Jais 13b, Arabic GPT-4, Arabic Vikhr 7b, Russian
CCP (ours) 0.64 ± 0.03 0.66 ± 0.02 0.56 ± 0.05 0.68 ± 0.04
Maximum Prob. 0.52 ± 0.03 0.59 ± 0.02 0.55 ± 0.08 0.63 ± 0.04
Perplexity 0.51 ± 0.04 0.56 ± 0.02 0.54 ± 0.08 0.58 ± 0.04
Token Entropy 0.51 ± 0.04 0.56 ± 0.02 0.54 ± 0.08 0.58 ± 0.04
P(True) 0.52 ± 0.03 0.59 ± 0.02 0.55 ± 0.08 0.63 ± 0.04

Number of claims

100

Vicuna 13b, English

1,603

Yi 6b, Chinese

200

GPT-4, Arabic

146

Vikhr 7b, Russian

Datasets and Statistics

For Arabic, using GPT-4, we generate 100 biographies of people randomly selected from the list of the most visited websites in Arabic Wikipedia. The used Arabic prompt is the translation of: «Tell me the biography of {person name}». To extract claims, we prompt GPT-4 in the following way: «Convert the following biography into Arabic atomic factual claims that can be verified, one claim per line. Biography is: {biography}». Arabic biographies and claims are translated into English using Google Translate. It is worth mentioning that almost one-third of the names in the list of person names are foreign

For Jais 13b experiments, we use the same prompts used for GPT-4. We notice that the biographies generated by Jais 13b are much shorter than the ones generated by GPT-4 (almost half-length). Similarly, we use GPT-4 to extract claims from the generated biographies. On average, biographies generated by Jais 13b have nine claims. Jais 13b generates empty biographies for seven names (out of 100) with response messages like: «I am sorry! I cannot provide information about {name}», or «What do you want to know exactly?». Two random claims from each biography are verified manually (total = 186 claims).

Since FactScore only supports English, for Arabic, Chinese, and Russian, we generate biographies of well-known people and annotate them only manually. We also manually annotate English claims generated by Vicuna 13b. The statistics for annotated datasets are presented in Table 7.

For Chinese, we first prompt ChatGPT to generate a list of famous people. Then use the same way as we have done Arabic, but change the prompt to Chinese, to biographies and claims. We use Yi 6b to generate texts GPT-4 to split them into atomic claims.

For Russian, we conduct a similar approach, prompting to generate a list of 100 famous people and checking result to obtain representative personalities from different areas such as science, sport, literature, art, activity, cinematography, heroes, etc.

Code

items.forEach((item) => { item.addEventListener('click', function () { const value = this.textContent selectedValue.textContent = value dropdown.classList.remove('open') }) })

BibTeX

@misc{salnikov2025geopolitical, title={Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models}, author={Mikhail Salnikov and Dmitrii Korzh and Ivan Lazichny and Elvir Karimov and Artyom Iudin and Ivan Oseledets and Oleg Y. Rogov and Alexander Panchenko and Natalia Loukachevitch and Elena Tutubalina}, year={2025}, eprint={2506.06751}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.06751} }