Reliable uncertainty estimation is essential for deploying LLMs in high-stakes domains. Prior work has focused on AU, measuring a model’s internal confidence through response consistency or verbalized scores. However, this approach fails when models are confidently wrong, producing the same incorrect answer repeatedly. Estimating the EU, which reflects uncertainty in the model itself, offers a solution but traditionally requires costly model training.
To address this, this paper leveraged a small ensemble of open-weight LLMs to estimate EU from cross-model semantic disagreement without additional training, combining it with AU to form a more robust TU metric.
A Framework for Total Predictive Uncertainty
AU captures the inherent randomness in a model's responses, when a model produces semantically diverse outputs for the same input, AU is high, whereas when responses are consistent, AU is low. This is measured by sampling multiple responses from the same model and calculating their semantic similarity.
EU, however, addresses a different question: whether the chosen model is actually the right one for the task. Even when a model is internally consistent, meaning it has low AU, it can still be confidently wrong. The authors measure EU by comparing the reference model’s responses with those from an ensemble of other models. When the reference model’s responses are semantically similar to the ensemble’s, EU is low; when they diverge, EU is high. Because an ideal “perfect” model is not available, the ensemble serves as a practical proxy.
TU is then defined as the sum of AU and EU. Empirically, the authors estimate these metrics by sampling responses: two per model for TU across five models, and 10 responses for AU alone to keep sampling budgets comparable. The experimental setup uses five instruction-tuned 7–9 billion-parameter models as the auxiliary ensemble and evaluates performance across ten diverse tasks, including question answering, math reasoning, translation, and summarization. Correctness is determined using a larger language model as a judge, while uncertainty quality is assessed using the area under the receiver operating characteristic curve (AUROC) and selective prediction metrics such as risk-coverage curves.
Complementary Strengths of the EU and the AU
The results demonstrated that the EU effectively complements the AU by identifying cases where models are confidently wrong. When analyzing an aggregated dataset, the authors found that in the low-AU regime, where models appear internally confident, incorrect responses consistently showed higher EU than correct ones. This confirms that EU flags overconfident failures that AU alone misses, challenging prior assumptions that low-AU predictions are inherently reliable.
The effectiveness of EU varies depending on the characteristics of the task. The analysis shows that EU performs best on tasks with a single correct answer, where models tend to produce similar phrasing when correct but diverge when uncertain. In settings with high model agreement and low complementarity, such as translation and conversational question answering (QA), EU provides strong discrimination.
By contrast, EU is less informative for tasks that allow multiple valid responses, such as summarization, where variation across model outputs is expected rather than indicative of uncertainty.
TU, defined as the sum of AU and EU, consistently improves correctness calibration across all benchmarks compared to AU alone. The largest gains occur on complex reasoning tasks like HotpotQA and high-accuracy tasks like translation, with AUROC improvements of up to 0.15. TU also outperforms multiple baselines, including semantic entropy and self-consistency scores.
In selective prediction experiments, where models abstain from answering when uncertain, TU achieves lower risk across all coverage levels compared to AU. TU improves selective accuracy at fixed coverage rates and reduces the area under the risk-coverage curve by over 20% on certain tasks, confirming that combining both uncertainty types leads to more reliable abstention decisions.
What This Means for Real-World LLM Deployment
In conclusion, this paper demonstrates that AU and EU capture complementary failure modes in language models. While self-consistency methods reveal data ambiguity, cross-model semantic disagreement uncovers uncertainty arising from model limitations. By combining both into TU using only black-box access to model outputs, the proposed approach consistently outperforms self-consistency-based methods across diverse models and tasks. However, the method has limitations. It struggles with tasks with multiple valid responses, relies on the quality of the model ensemble, and depends on a correctness judge. Future work should explore combining this approach with other uncertainty estimators and extending it to broader applications.
Download the PDF of this page here
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.
Sources:
Journal Reference
Kimia Hamidieh, Thost, V., Gerych, W., Mikhail Yurochkin, & Ghassemi, M. (2026). Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification. Openreview.net.
https://openreview.net/forum?id=lOoRJo8xWy
Zewe, A. (2026, March). A better method for identifying overconfident large language models. MIT News | Massachusetts Institute of Technology.
https://news.mit.edu/2026/better-method-identifying-overconfident-large-language-models-0319