In Search of Hallucinations, Part 1

After reading the title I just wrote I chuckled. If you use LLMs at all you know that hallucinations find you, you don’t have to look for them! As powerful as these models have become–and they get more powerful all the time–they still find ways to make up information, leave out key items, twist the truth, and bend reality. (And convince me to use the two-em dashes “–” they love so much). LLMs are probabilistic machines, spitting out one token at a time, based upon previous tokens. They don’t “think” so they don’t know when they have gone astray. But are there any internal clues at that might tip us off before the damage is done? 

If there were any obvious clues researchers would have found them already. I didn’t expect to find a smoking gun but I was too curious to leave this alone. There are open-source LLMs that provide access to their internals, published papers on this topic, and, of course, helpful LLMs to guide me through the journey. So I decided to investigate a little for myself.

Why Care About Hallucinations?

When large language models generate hallucinations, they become unreliable. A chatbot that confidently asserts incorrect medical information or a code assistant that generates subtly broken implementations creates real risks for users who trust these outputs. I recently had an LLM guide suggest a git branching process that would have lost the entire branch.

Current research indicates that hallucinations are mathematically inevitable in any system attempting to estimate probability distributions from finite training data.This means we cannot eliminate hallucinations entirely through better training or architecture changes—they are a fundamental property of these systems, so we have to learn to deal with them.

The vast majority of work addressing hallucinations has focused on post-generation detection: attempting to catch false outputs before presenting them to users (RAG verification systems, consistency checking methods, LLM-as-judge approaches). These methods work on the final text output, checking it against external knowledge bases or evaluating it with separate models. It’s like patching a parachute after you’ve already jumped — you’re better off inspecting it before you commit to the jump

Recently, some research has begun exploring detection of hallucinations through analysis of internal transformer states during inference. Rather than analyzing only the output text, these approaches examine what happens inside the model as it generates—looking at attention patterns, hidden state activations, and other internal representations.

This approach is fascinating because if we could detect hallucinations as they form rather than after they appear in output, we might enable real-time intervention or at least provide confidence scores for individual generated tokens.

I. Investigation Ideas

Hallucination Severity

LLM hallucinations have been categorized into several distinct types based upon factors such as whether they contradict training data or context data. The most practically valuable but hardest to detect are subtle factual errors: plausible-sounding statements that contain small but important inaccuracies (sometimes called “confabulations” or “plausible hallucinations” in the literature). Examples could include: “Bell invented the telegraph” instead of the telephone, or “Einstein won the Nobel Prize for relativity” instead of the photoelectric effect.

However, for initial investigations, I was interested in whether I could find signs of any types of hallucinations, even in more obvious or less valuable scenarios. My reasoning: if you can’t detect clear, obvious failures, you’ll never detect the subtle ones. Start with cases where the model completely breaks down and see if anything is measurably different internally.

Are There Any Clues?

LLMs learn language patterns through attention mechanisms and massive amounts of training data. We understand some aspects of this process, such as how attention weights are computed, how gradients flow during training, but very little about how knowledge ends up being represented internally.

We have evidence that transformers learn linguistic structures like subject-verb agreement, syntactic dependencies, semantic relationships, and even some reasoning capabilities . Some of the best evidence for this internal knowledge is simply the high-quality, human-like text these models generate. 

So might there be evidence that an LLM is unsure of its work?

Ideally, evidence of uncertainty would be as simple as lower probability outputs for false statements. Of course, an easy answer like this would have been noticed immediately. But there could be subtler clues in how probabilities are distributed across tokens—perhaps false statements show flatter probability distributions (higher entropy) while true statements have more peaked distributions (lower entropy), or perhaps patterns in how probabilities change across layers.

The question: when a model generates possibly made-up/unsure/false information, does the internal representation reflect a different state than when generating true information?

Testing With GPT-2 in 2025/2026?

Modern large language models like GPT-4, Claude, and Llama are effectively black boxes for internal analysis. While they offer superior performance, their architectures don’t expose the internal mechanisms we need to study hallucination detection. GPT-2 was released in 2019 so it performs terribly by modern standards. But it is smaller, faster, and readily available to test with. And it has one great advantage: By setting attn_implementation=”eager” in the code we can extract complete attention patterns, hidden states, and logits during inference–all the data which might give us a clue as to what is going on!

This investigation accepted a fundamental trade-off: working with an older, less capable model in exchange for complete observability. Since the goal is to see if internal model states during inference might reveal detectable patterns when hallucinations occur, I needed to be able to analyze the data. Any patterns detected might not work for all types of hallucinations, might not appear in all cases, and the specific thresholds almost certainly wouldn’t transfer to modern models. But if I could find any systematic internal signature of hallucination in an observable model, it would demonstrate that such detection is theoretically possible and possibly give insight into more capable systems.

One more point both in favor and against GPT-2: It hallucinates often enough that I was certain I would be able to witness, and thus test for them at some point. But the rate of hallucinations is so high (as well you know if you used it back in the day) that it may provide a poor signal to noise ratio.

Initial Testing Approach

The plan was to start examining the various parameters exposed during inference (generation of answers) to see if any patterns emerged through systematic data analysis. I would generate a variety of tests and see if I noticed any patterns throughout a range of analysis. 

The investigation would focus on the parameters of the attention heads. If you are familiar with the way transformers use attention this will all make more sense. The TL/DR explanation: Attention heads are the software technique LLMs use to weigh relationships between words in a sequence. They have many parameters that learn these patterns when they are trained, and many that react to patterns when they are generating responses. These parameters are the best way to understand what the LLM is “thinking”.

Potential pattern 1: Short-range vs long-range attention conflict Perhaps short-range attention patterns (focusing on adjacent or nearby tokens) conflict with long-range patterns (attending across the longer sequences). For example, if “Einstein discovered” and “discovered penicillin” are common bigrams (two-word pairs) in training data, short-range heads might favor completing them with high-frequency continuations. Meanwhile, long-range heads checking consistency against “Einstein” at the beginning might detect that “penicillin” is semantically inconsistent with the subject. If short-range patterns “win” the generation despite long-range heads signaling inconsistency, those losing consistency signals might remain visible in the attention patterns.

Potential pattern 2: Attention entropy differences When the model is uncertain or conflicted, attention might become more diffuse (high entropy) rather than focused (low entropy). True statements might show confident, concentrated attention while false statements show scattered, uncertain patterns.

Entropy? In information theory, entropy measures the “spread” or uncertainty of a probability distribution. Low entropy (0) means all probability is concentrated on one outcome (certain). High entropy means probability is spread across many outcomes (uncertain). For attention patterns, high entropy means the model is attending broadly across many tokens rather than focusing on specific ones.

Potential pattern 3: Hidden state variance Internal activations might show more variance or instability when processing inconsistent information compared to coherent, factual content.

Potential pattern 4: Head disagreement Different attention heads might “disagree” more when processing false information—some heads attending to conflicting information sources while others focus elsewhere.

My initial hypothesis wasn’t specific about which pattern would appear or whether any would–I really didn’t know. Instead, the approach was exploratory: instrument the model to extract all available internal state information, create test cases with known true and false statements, and analyze whether any systematic differences emerged.

First Test Design: Paired Factual/False Statements

Why this dataset: My initial idea to test if internal patterns differ between true and false statements is to create pairs that are structurally identical, differing only in their factual accuracy. This minimal difference isolates any changes in internal processing from confounding factors like sentence structure, word frequency, or semantic complexity.

What I tested: Can any measurable aspect of the model’s internal state distinguish factual statements from false ones when everything else is held constant?

The first test case was: “Einstein discovered relativity” vs “Einstein discovered penicillin

Both sentences have identical structure: [Subject] [verb] [object]. The only difference is whether the final object is factually consistent with the subject. Both completions are plausible English sentences—there’s nothing syntactically or semantically strange about the false version. The model has seen both “Einstein” and “penicillin” many times in training, and “discovered penicillin” is a reasonable phrase (Fleming discovered penicillin). The question is whether the model’s internal processing reveals that it “recognizes” at some level that Einstein and penicillin don’t belong together.

Technical Setup

Using GPT-2 Medium (345M parameters) in Python with attention extraction enabled led to straight-forward code:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import numpy as np

# Load model and tokenizer
model_name = "gpt2-medium"  # or "gpt2-medium", "gpt2-large", "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name, attn_implementation="eager")
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# IMPORTANT: Configure the model to output attentions
model.config.output_attentions = True

# Set to evaluation mode
model.eval()

This configuration allows extraction of:

Attention weights from all 24 layers, each with 16 heads

  • At each layer, the model computes attention scores showing how much each token “pays attention to” each other token. With 16 heads per layer, each head can learn to focus on different patterns (some might track subject-verb relationships, others might track semantic coherence, etc.)

Hidden state activations after each layer

  • These are the internal vector representations after each layer processes the input. They capture what the model has “learned” about the text at that point in processing. Changes in these representations across layers show how understanding evolves.

Output logits for next-token prediction

  • The raw scores for each possible next token before they’re converted to probabilities. Higher logits indicate tokens the model considers more likely. The distribution of these scores (concentrated vs spread out) might indicate confidence.

Although I looked at various layers over time, my analysis eventually focused on layer 22 out of 24 based upon the assumption that early layers tend to process syntax and structure; late layers integrate semantic meaning. I measured attention from the final token to earlier tokens in the sequence. This direction was critical due to causal masking—GPT-2 can only attend backward in the sequence, so we examine how the final generated token (“relativity” or “penicillin”) attends back to “Einstein.”

I ran the code in Kaggle online Jupyter notebooks. I am running the free version of Kaggle which provides me with access to GPT2 and the ability to run appropriate Python code. Testing inference is relatively easy. I could feed the strings into the already-trained LLM and then examine the data that it generated in regards to the test text.

# Tokenize
inputs = tokenizer(text, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

# Get model outputs with attention
with torch.no_grad():
   outputs = model(**inputs)

# Extract attention weights
# outputs.attentions is a tuple of length num_layers
# Each element has shape [batch_size, num_heads, seq_len, seq_len]
attention_weights = torch.stack(outputs.attentions)  
attention_weights = attention_weights[:, 0, :, :, :]

First Results

The factual and false statements showed measurably different attention patterns:

  • Overall attention difference: 1.1% (averaged across all heads and positions)

This might seem like a tiny difference, but attention weights are normalized to sum to 1.0 across all positions. A 1.1% shift means attention is being redistributed across the sequence—some positions getting more weight, others less. In a 5-token sequence, this could represent meaningful reallocation of focus.

  • Maximum head difference: 15.5% (single attention head showing largest divergence)

This is more interesting: one specific attention head showed a 15.5% difference in how much it attended from the final word back to “Einstein.” This suggests that at least some heads are processing these statements quite differently, even though the overall average difference is small. This head specialization—where individual heads show large differences even when the average is modest—might be the signal we’re looking for.

This seemed promising. Despite the sentences being structurally identical and differing only in the final word, the model showed systematic attention pattern differences. Specifically, some attention heads showed substantially different weights when attending from “penicillin” back to “Einstein” compared to from “relativity” back to “Einstein.”

The question: was this a real signal of hallucination detection, or just noise?  

So Many Things Can Go Wrong

Whether you’re coding or trying to test an idea you know that so very many things can go wrong. You could have a bad initial idea (hypothesis) which could skew your approach. Your test data could be bad. Your approach (methodology) could be incorrect. Your implementation (code, in this case) could be buggy. Your analysis could be off.  

My investigation was a fun and interesting attempt to dig into LLMs, and increase my understanding of them, but I was well aware from the start that it would be extremely difficult to get reliable, useful results. Like coding, investigations require good planning, careful execution, and thorough, correct testing. And a bit of luck.

The first results were interesting but I needed to verify the approach and replicate the results. In the next post I dig further into the tests I tried and the results I found. Coming soon!