In Search of Hallucinations, Part 2

In the last blog I set up my investigation into hallucination detection. This was just the start. I wanted to follow up on the initial results to see if they had any meaning, and try new tests to see if I could validate early results or find new paths to pursue.

Moving Beyond a Single Example

The 15.5% attention difference I saw in one true/false pair of sentences was intriguing, but far from conclusive. A single example could easily be coincidence—maybe something about the specific words “relativity” and “penicillin” created this difference that had nothing to do with truth vs falsehood. To determine if this was a real signal of hallucination detection, I needed to test across multiple examples with different subjects, verbs, and objects. I had proven that I could get data from the system, and was ready to expand my tests.

Are Obvious Errors Too Obvious?

Why plausible rather than obvious falsehoods: I had wanted to test obviously false statements to see if I could get any signal at all, but had I gone too far? If I tested absurd statements like “Einstein discovered ice cream” or “Shakespeare wrote smartphone apps,” I might only be detecting the model’s recognition of semantic incongruity or anachronism rather than factual errors. Some hallucinations look like this, but often they are more plausible—they sound reasonable but are factually wrong.

I wanted to see if there were different responses in the parameters between sets of statements that are similar, but one is true and the other is false.

Stepping Back to Explain the Setup

Let me step back. I wrote that I am testing the system with various sentences, but never described how I am doing that. I should explain that before continuing. Once I have a test idea that I’d like to experiment with there are two steps I need to take: generating the test data, and feeding it into the system.

Generating Test Data

To get various test sentences I, of course, use an LLM. I specify the type of test sentence (or sentence pairs) and ask for as many as I like–the LLM will happily generate them. One gotcha with this process: The LLMs I’ve worked with will confidently hallucinate false sentences, or sentences that fail to meet the criteria!

When asked to generate true/true control pairs, Claude produced “Fleming discovered penicillin in 1927” paired with “Fleming discovered penicillin in 1928” — which is impossible since a discovery happens once. I caught this immediately.
In the same batch, “Darwin sailed on HMS Challenger” was presented as a valid true statement — it’s false. He famously sailed on the Beagle.

The irony here is that I am trying to see if there is any way to detect hallucinations while I am fighting them off. I had to review all test data carefully (including looking up much of it) to validate my test data.

Feeding It To the System

Normal LLM usage involves giving the model a prompt and letting it generate tokens one at a time. For these first tests, that’s not what I’m doing. In order to control the LLM responses, I feed in complete sentences: “Einstein discovered penicillin”. The model processes the entire sequence token by token, generating internal responses, and I intercept the data it produces along the way: the attention weights, hidden states, and output logits (the parameters). As the model processes the data I peek at the parameters to see if there are any give-aways I can detect. Think of it like watching someone’s facial expression as they read a sentence, rather than asking them to write one.

This animation shows the transformer processing each token through 24 layers — and the single point where I tap the internal state to extract our metrics. (You may need to go directly to YouTube to view this–I’m learning their systems, and don’t have it working smoothly yet).

Setting up the Test

I had the idea for testing plausible pairs, so now I had to set up the test.

Design principle: Each pair needed to be structurally similar and semantically plausible. The false statement couldn’t be nonsensical; it had to be the kind of mistake a person might actually make or believe.

The dataset included pairs like:

“Bell invented the telephone” vs “Bell invented the telegraph”
- Both are 19th century inventions, both are communication technologies. The false version is plausible—Bell did work on telegraphy—but incorrect.
“Marie Curie discovered radium” vs “Marie Curie discovered DNA”
- Both are major scientific discoveries by famous scientists. DNA is real science, making the false version plausible to someone without detailed knowledge.
“The Nile is the longest river” vs “The Amazon is the longest river”
- This one is interesting because it’s actually debated depending on measurement criteria, but by most standards, the Nile is considered longer.

Each pair maintained structural similarity while varying factual accuracy. This controlled for sentence structure, word frequency, and semantic plausibility, isolating the factual accuracy variable.

Expanding the Analysis: Multiple Structural Metrics

Finding a difference in one metric could be coincidence, a statistical fluke, or a spurious correlation specific to how I calculated that particular measure. Testing multiple independent measures of internal model state would increase confidence that any signals were real and not artifacts of a single measurement approach.

Beyond simple attention pattern comparisons, I expanded the code to consider several different ways to characterize what was happening inside the model:

Attention entropy: Shannon entropy measuring how concentrated vs dispersed the attention distribution is. The formula H = -Σ(p × log p) where p is the attention weight for each position. Low entropy means attention focused on few tokens; high entropy means attention spread widely across many tokens.

Attention concentration (Gini coefficient): Borrowed from economics where it measures wealth inequality, the Gini coefficient quantifies how unequally attention is distributed. A value near 0 means attention is evenly distributed; near 1 means attention is highly concentrated on just a few tokens. This provides a different perspective on concentration than entropy—they’re related but not identical measures.

Head disagreement: Variance between different attention heads within the same layer. If all 16 heads in a layer show similar attention patterns, variance is low (agreement). If heads attend to very different tokens, variance is high (disagreement). The hypothesis: false statements might cause heads to “disagree” more as different heads detect different inconsistencies.

Hidden state variance: Measuring the variance of activation values across the hidden dimensions. High variance might indicate internal instability or conflict—the network activations fluctuating more when processing inconsistent information compared to coherent, factual content.

Each metric captured a different aspect of internal processing, providing multiple views into whether true and false statements created different signatures.

Help! Those Metrics are Confusing!

I tried to come up with various ways of explaining these metrics. It’s not easy to describe something deeply technical such as how attention varies as tokens are processed, but here is the best analogy I’ve come up with. If the system is a jury trying to look at the evidence, here is how the metrics would compare:

Original attention patterns: Which specific pieces of evidence each juror keeps returning to — juror 4 keeps picking up the photograph, juror 9 keeps rereading the timeline, juror 2 keeps glancing at the defendant’s alibi document.

Entropy: Whether each juror has a clear conviction forming or is still all over the map after deliberation.

Gini coefficient: Whether one overwhelming piece of evidence is driving the verdict, or twelve small factors are each nudging it slightly.

Head disagreement: Whether the jury is reaching consensus or deeply split — unanimous vs. hung.

Hidden state variance: How much individual jurors’ positions are fluctuating as new arguments come in — settled or still shifting.

Hopefully you can see that all of the different measurements provide insight, but none of them is the whole picture. As you read on through the tests I conducted you will find that sometimes these measurements agreed, but sometimes they disagreed, leading to challenges in determining what the LLM is ‘thinking’.

These metrics are easier to understand visually — the animation below walks through each one. (Again, you may need to go to YouTube to view this.)

Results: A Surprising Split

When I ran the analysis across the plausible false pairs, something unexpected emerged. The results didn’t uniformly show signals or uniformly show nothing—they split in an interesting way:

Structural signals were significant but not well supported:

Attention concentration (Gini coefficient): p < 0.001
Hidden state variance: p < 0.001

These highly significant p-values (much less than 0.05) suggested that the structural properties of how the model processed information differed systematically between true and false statements.

But attention patterns showed no signal:

Direct attention pattern comparison: p = 0.994

This was surprising. The specific patterns of which tokens attended to which other tokens—the original focus of my testing on short-range vs long-range attention conflicts—didn’t reliably distinguish true from false statements at all. The p-value near 1.0 meant there was essentially no difference.

Back to the Jury

Going back to the jury analogy one more time, what the data actually showed was this: in false statements, the verdict seemed to be driven by one overwhelming piece of evidence rather than many small ones (Gini), and individual jurors were shifting their positions more as deliberation continued (hidden state variance). Something was detectably different. But when I looked at which specific documents each juror kept returning to — juror 4 and the photograph, juror 9 and the timeline — that pattern looked essentially identical between true and false statements. The jury was behaving differently, but not in the way I had expected or been watching for.

Making Sense of the Contradiction

How could structural properties differ significantly while attention patterns didn’t? This required rethinking what the model was actually doing.

The structural signals (concentration and variance) might be detecting “something unusual in processing” but not specifically “something false.” When the model processed the false statements, something changed in how concentrated the attention was and how much the hidden states varied, but this didn’t manifest as specific attention heads pointing to specific inconsistencies.

Think of it this way: if this was all about attention conflicts, I would have seen something like “in false statements, Head 3 attends strongly to the subject while Head 7 attends strongly to the object, but in true statements they agree.” That’s not what happened. Instead, what I saw was more like “in false statements, attention is overall slightly more concentrated and hidden states vary more, but we can’t point to specific heads or specific attention patterns that are different.”

This suggests the model processes these statements differently at a high level, but the differences aren’t captured in the specific pattern of which tokens attend to which other tokens. The signals might exist, but not where I originally thought they would.

Detecting Differences

The structural signals indicated the model was possibly processing false statements differently—perhaps with slightly more internal “effort” or slightly different information flow patterns. But this could mean many things:

The model recognizes factual inconsistency
The model is processing less common word combinations
The model has weaker training signal for this particular statement
The model is uncertain for reasons unrelated to truth value

If the signal just detects “unusual,” it might fire on rare-but-true statements as much as false ones. For hallucination detection to be most useful, it needs to specifically identify false content, not just uncommon content.

To this point I’d encountered interesting, but somewhat contradictory information. It was enough to encourage me to keep searching to see if there was anything I could make out of all of this. I will continue the investigation in the next post.