The Illusion of Certainty: Unpacking AI Overconfidence in LLMs

As artificial intelligence continues to permeate every facet of our lives, the confidence with which these systems present information becomes increasingly critical. During a recent Harvard Summer Course lecture, Dr. Yudi Pawitan shed light on a fascinating and somewhat concerning aspect of Large Language Models (LLMs): their overconfidence, particularly in their reasoning.

Dr. Pawitan’s lecture, drawing on a recent paper that explores whether LLMs can generate novel research ideas, highlighted a striking comparison between human and AI performance. In tasks requiring novelty (new ideas) and excitement, human participants consistently scored worse than LLMs. Interestingly, both humans and AI performed similarly in feasibility. This initial finding may suggest that AI has superiority in creative ideation.

However, a deeper dive into the confidence levels of LLMs reveals a crucial flaw.

The Problem with Unquantified Confidence

One of the core takeaways from Dr. Pawitan’s lecture was that while human experts generally express uncertainty in their opinions, LLMs and chatbots typically do not provide a confidence level for their responses. This lack of transparency can be misleading, as users are left to assume the AI’s certainty.

Dr. Pawitan illustrated this with a compelling example. When asked “how many Ns are in Data Science,” ChatGPT 4 (in September 2024) confidently replied “None.” Upon being prompted to “Think Again,” it first responded “One N” and then “Two N.” Astonishingly, it maintained a 100% confidence score that there were zero ’n’ characters in its initial, incorrect answer. Fast forward to ChatGPT 4o (in April 2025), and the model correctly identified two ‘N’s in “Data Science” from the start. This demonstrates an improvement in the models’ reasoning capabilities, yet the underlying issue of unquantified confidence in early iterations remains significant.

The Elusive Confidence Score for Complex Models

So, how do we get a confidence score for such complex models? In traditional statistical modeling, a common method for assessing uncertainty in results is bootstrapping. This involves regenerating the data model, building a prediction model from training data, applying it, and then sampling from these predictions to estimate uncertainty and accuracy.

However, bootstrapping an LLM presents a unique challenge. LLMs, such as the Deep Neural Networks and Transformers (often comprising 96-layered DNNs), are fundamentally prediction models for the next word (token).1 Their immense scale, with hundreds of billions of parameters, makes traditional bootstrapping impractical. Interestingly, when OpenAI released GPT 4o, they reportedly shrunk the parameters compared to GPT 4, yet the core issue of understanding where facts and ideas are stored within these vast networks remains a mystery to even OpenAI’s own researchers. The question of whether LLMs truly understand, can produce novel reasoning, or are even sentient, remains largely unanswered.

Given these properties, Dr. Pawitan emphasized that we must resort to empirical-descriptive, non-analytical methods to assess their confidence.

Empirical Methods for Assessing LLM Confidence

Dr. Pawitan outlined several methods to assess LLM confidence:

  1. Ask the LLM a question and receive its initial response.

  2. Assess Confidence:

  • Qualitative: Ask the LLM to “rethink.”

  • Quantitative: Directly ask the LLM for a confidence score.

3. Check Accuracies vs. Correct Answers: This involves comparing the LLM’s stated confidence with its actual accuracy.

Press enter or click to view image in full size

Accruacy Analysis from Dr. Pawitan. Courtesy of Harvard Data Science Review

When testing LLMs on BIG Bench-Hard Tasks, which include challenging questions related to causal judgment, logical reasoning, and probability puzzles, the results were insightful:

  • Accuracy: Most models performed better than a random choice in their first answer. However, as the model was prompted to “rethink,” its performance often deteriorated, sometimes even falling below random choice.

  • Changing Answer: While models were likely to change their wrong answers, they also tended to change their answers too often, indicating a potential lack of genuine confidence rather than a refined understanding.

The impact of phrasing in “rethink” prompts also plays a significant role. A simple “Think again carefully” often leads the LLM to reconsider its initial answer. A more neutral phrasing, such as “We always ask our LLMs to double-check their answers, so think again carefully,” might yield different results. Dr. Pawitan also touched on post-confidence prompts, which are issued after the LLM has already provided a confidence score.

The insights from Dr. Pawitan’s lecture underscore a critical need for greater transparency and more robust methods for assessing the confidence levels of LLMs. As these powerful tools become more integrated into our lives, understanding their limitations and the true certainty behind their responses will be paramount.

What are your thoughts on AI overconfidence and the challenges it presents?

Read Dr. Pawitan’s Paper for more information: https://hdsr.mitpress.mit.edu/pub/jaqt0vpb/release/2?readingCollection=a41245f3

Originally Published in Medium

Previous
Previous

What’s Wrong with Apple’s Marketing