Posted in | News | Artificial Intelligence

Better Uncertainty Quantification for Large Language Models

Download PDF Copy

By Soham NandiReviewed by Louis CastelApr 2 2026

In an article published in the Massachusetts Institute of Technology (MIT) News, researchers introduced a method to improve uncertainty quantification in large language models (LLMs). They added an epistemic uncertainty (EU) term, using text from a small model ensemble, to standard aleatoric uncertainty (AU). The combined total uncertainty (TU) better identifies confident errors, improving calibration and selective abstention across multiple models and tasks.

Image Credit: 3Dsss/Shutterstock.com

The Problem with Overconfident AI Outputs

Reliable uncertainty estimation is essential for deploying LLMs in high-stakes domains. Prior work has focused on AU, measuring a model’s internal confidence through response consistency or verbalized scores. However, this approach fails when models are confidently wrong, producing the same incorrect answer repeatedly. Estimating the EU, which reflects uncertainty in the model itself, offers a solution but traditionally requires costly model training.

To address this, this paper leveraged a small ensemble of open-weight LLMs to estimate EU from cross-model semantic disagreement without additional training, combining it with AU to form a more robust TU metric.

A Framework for Total Predictive Uncertainty

AU captures the inherent randomness in a model's responses, when a model produces semantically diverse outputs for the same input, AU is high, whereas when responses are consistent, AU is low. This is measured by sampling multiple responses from the same model and calculating their semantic similarity.

EU, however, addresses a different question: whether the chosen model is actually the right one for the task. Even when a model is internally consistent, meaning it has low AU, it can still be confidently wrong. The authors measure EU by comparing the reference model’s responses with those from an ensemble of other models. When the reference model’s responses are semantically similar to the ensemble’s, EU is low; when they diverge, EU is high. Because an ideal “perfect” model is not available, the ensemble serves as a practical proxy.

TU is then defined as the sum of AU and EU. Empirically, the authors estimate these metrics by sampling responses: two per model for TU across five models, and 10 responses for AU alone to keep sampling budgets comparable. The experimental setup uses five instruction-tuned 7–9 billion-parameter models as the auxiliary ensemble and evaluates performance across ten diverse tasks, including question answering, math reasoning, translation, and summarization. Correctness is determined using a larger language model as a judge, while uncertainty quality is assessed using the area under the receiver operating characteristic curve (AUROC) and selective prediction metrics such as risk-coverage curves.

Complementary Strengths of the EU and the AU

The results demonstrated that the EU effectively complements the AU by identifying cases where models are confidently wrong. When analyzing an aggregated dataset, the authors found that in the low-AU regime, where models appear internally confident, incorrect responses consistently showed higher EU than correct ones. This confirms that EU flags overconfident failures that AU alone misses, challenging prior assumptions that low-AU predictions are inherently reliable.

The effectiveness of EU varies depending on the characteristics of the task. The analysis shows that EU performs best on tasks with a single correct answer, where models tend to produce similar phrasing when correct but diverge when uncertain. In settings with high model agreement and low complementarity, such as translation and conversational question answering (QA), EU provides strong discrimination.

By contrast, EU is less informative for tasks that allow multiple valid responses, such as summarization, where variation across model outputs is expected rather than indicative of uncertainty.

TU, defined as the sum of AU and EU, consistently improves correctness calibration across all benchmarks compared to AU alone. The largest gains occur on complex reasoning tasks like HotpotQA and high-accuracy tasks like translation, with AUROC improvements of up to 0.15. TU also outperforms multiple baselines, including semantic entropy and self-consistency scores.

In selective prediction experiments, where models abstain from answering when uncertain, TU achieves lower risk across all coverage levels compared to AU. TU improves selective accuracy at fixed coverage rates and reduces the area under the risk-coverage curve by over 20% on certain tasks, confirming that combining both uncertainty types leads to more reliable abstention decisions.

What This Means for Real-World LLM Deployment

In conclusion, this paper demonstrates that AU and EU capture complementary failure modes in language models. While self-consistency methods reveal data ambiguity, cross-model semantic disagreement uncovers uncertainty arising from model limitations. By combining both into TU using only black-box access to model outputs, the proposed approach consistently outperforms self-consistency-based methods across diverse models and tasks. However, the method has limitations. It struggles with tasks with multiple valid responses, relies on the quality of the model ensemble, and depends on a correctness judge. Future work should explore combining this approach with other uncertainty estimators and extending it to broader applications.

Download the PDF of this page here

Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.

Sources:

Journal Reference

Kimia Hamidieh, Thost, V., Gerych, W., Mikhail Yurochkin, & Ghassemi, M. (2026). Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification. Openreview.net.

https://openreview.net/forum?id=lOoRJo8xWy

Zewe, A. (2026, March). A better method for identifying overconfident large language models. MIT News | Massachusetts Institute of Technology.

https://news.mit.edu/2026/better-method-identifying-overconfident-large-language-models-0319

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2026, April 02). Better Uncertainty Quantification for Large Language Models. AZoRobotics. Retrieved on May 18, 2026 from https://www.azorobotics.com/News.aspx?newsID=16371.
MLA
Nandi, Soham. "Better Uncertainty Quantification for Large Language Models". AZoRobotics. 18 May 2026. <https://www.azorobotics.com/News.aspx?newsID=16371>.
Chicago
Nandi, Soham. "Better Uncertainty Quantification for Large Language Models". AZoRobotics. https://www.azorobotics.com/News.aspx?newsID=16371. (accessed May 18, 2026).
Harvard
Nandi, Soham. 2026. Better Uncertainty Quantification for Large Language Models. AZoRobotics, viewed 18 May 2026, https://www.azorobotics.com/News.aspx?newsID=16371.

Tell Us What You Think

Do you have a review, update or anything you would like to add to this news story?

Leave your feedback

(Logout)

Public Comment

Private Feedback to AZoRobotics.com

Submit

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.