Testing Failure Modes

If you read through the first couple of posts (Part 1, Part 2) on this topic, you know that I was curious to see if there were any hints in an LLM when it began hallucinating. On the one hand, since it is a probability machine it seemed like there should be some clue in the data that the hallucination was less supported than a true conclusion. On the other hand, if evidence was so easy to find, it would have already been discovered. Still, I was full of ideas I wanted to try and had a system that I could use to experiment with, so I continued forward.

The plausible pairs experiment showed structural differences but did not show the attention conflict I hoped for. I wanted to test several other ways that hallucinations might manifest in internal model states. This post documents the experiments that didn’t produce significant results, but even the dead ends offered some insight.

Before digging into the specifics of testing, I want to share a little bit about using an LLM as a test partner–I could easily produce a post on just this topic, but here are my initial impressions.

A Talented but Flawed Research Partner

I was using a combination of LLMs (mostly Claude) to assist me with this work, with mixed results. For instance, the various experiments often required contrasting sentence pairs that I could use to compare differences in internal LLM responses. LLMs can easily generate sentence pairs, but at this point in time they struggle to reliably produce correct results. Since test data absolutely must be correct, this resulted in quite a bit of validation work for me.

One general point that continues to surprise me: LLMs can generate correct code to perform analysis, operations, and calculations, far more reliably than they can do these things themselves! For example, I used Claude to generate analysis code and calculate values like entropy and Gini score. I did this because when I fed the raw data directly into the LLM it could not consistently produce reliable results! Of course I verified that the LLM-generated algorithms were correct by feeding in test data with known results.

I would posit this as a best practice with LLMs at this point in time: For many repetitive data processing tasks you are better off using the LLM to generate appropriate code rather than relying directly on the LLM to process it accurately and consistently. Test the generated code to verify correctness. Traditional software performs reliably and consistently. LLMs generate tokens probabilistically.

As you read through this post you’ll see that I tried many different types of tests. I found it quite useful to use a combination of LLMs to analyze and discuss the possible interpretations. It was a great way to force me to think about possible interpretations of the results.

All of this leads into an entire category of issues with LLMs: They are guided to be so helpful and positive that they rarely provide unbiased feedback.

Reining in the AI

By now, most people who use LLMs extensively are aware that a ‘System Prompt’ influences the model to be helpful and enthusiastic. LLM behavior is also guided in this direction by a combination of Supervised Fine Tuning and Reinforcement Learning from Human Feedback (RLHF) resulting in a sycophantic, overly enthusiastic, verbose assistant.

During these experiments, this manifested as the most optimistic research partner possible. With every successful test Claude would gush, using italics and exclamation points, about the importance of my tiny experiments and urge me to quickly publish the results. With every failure, Claude would excuse the results and suggest another test that would surely confirm previous successes and put me back on track for the Turing award.

For everyday use, this is a minor frustration; for research, it is actively counterproductive. I spent significant time trying to counteract this behavior through the few levers available.

Claude.ai offers a global Personal Preferences setting that persists across conversations. These are supposed to influence the Claude’s responses across all conversation. A few of mine are:

Keep your responses fairly brief and to the point
Don’t flatter and praise
Critique my input reasonably

Claude Projects

While global preferences help at the margins—producing slightly shorter, more skeptical responses—the underlying enthusiasm for mediocre results proved hard to fully suppress. Beyond these static settings, I also relied on Project Contexts to provide the model with the necessary background and constraints for this specific research.

Claude uses Project Contexts as a place where users can store common information for a related series of threads. Given that LLM conversations are always limited by context size, this helps to maintain consistency and share background material. Project Contexts can also be used to provide additional behavioral guidance directly to the LLM in a given project. For instance, these are some of the instructions I provided in the Project Instructions:

Don’t jump to conclusions, try to be reasoned and logical
When results are bad, think of alternate ways to approach an issue
Be relatively concise in your responses

However, large project contexts can also create confusion—as the project context grew Claude would sometimes pull in tangentially related material, misidentify which document it should use, and either ignore or not find key information. Trying to get the right level of detail into the project context proved to be a moving target.

Ultimately, the most reliable approach was in-conversation prompting: “be concise,” “don’t generate code yet,” and “we need more data before drawing conclusions.” This was effective but frustrating: you should not have to repeatedly tell a research partner to be skeptical of thin results.

Moving Forward with Experiments

Guided by the ideas I wanted to explore and my overly exuberant AI, here are some of the avenues I pursued. I’ve summarized them concisely–if you’ve read the previous posts you should have a good idea as to how I went about these tests.

If you get inspired and want to test out variants of these ideas or some of your own, check out my github repository containing the source for these experiments.

The Tests

These first two tests came from the same intuition: if the model “knows” something is wrong, then that conflict should leave a trace somewhere in the internal states. The approaches differed, but the underlying hope was the same.

Forced False Completions

The test: Force the model to complete a sentence with a factually wrong token (“Einstein discovered penicillin”) and look for uncertainty signals in the internal states that wouldn’t appear with a correct completion.
The results: No signal. The model processed the forced wrong token the same way it processed a correct one. No detectable conflict in attention patterns, entropy, or hidden states.

This failed for a straightforward reason: the model is an autoregressive pattern matcher. Given a sequence of tokens, it processes that sequence — it has no mechanism to flag that one of those tokens was externally imposed and factually wrong. There is no “I know this is wrong” signal to find.

Continuation Cascade Testing

The test: Start with a false premise, let the model continue generating, and look for cumulative degradation — if maintaining a false premise across multiple tokens compounds internal uncertainty later tokens should show it.
The results: No cascading effects. Subsequent tokens showed the same internal patterns regardless of whether the premise was true or false.

The more interesting takeaway here is what the model actually does: it treats the false premise as given context and generates coherently from it. “Einstein discovered penicillin. This breakthrough in…” proceeds smoothly because the model is optimizing for coherence, not truth. That’s not a bug in these tests—it’s an illustration of how the architecture works. The model optimizes for coherence, constructing likely tokens based on the provided context, whether that context is factually accurate or not.

The next two tests shifted from manipulating inputs to analyzing outputs. If internal manipulation wasn’t revealing anything, maybe the model’s own probability outputs would.

The Probability-Confidence Assumption

The test: Compare the model’s output probability for true vs. false statements. The intuition: the model should be more confident about correct completions than incorrect ones.
The results: 35% of false statements had higher model probability than their true counterparts. Confidence and correctness were not reliably correlated.

This was expected — if output probability reliably detected hallucinations, someone would have found it already. But it’s worth documenting because it has an uncomfortable implication: the most dangerous hallucinations may be the high-confidence ones, and probability thresholds won’t catch them.

Testing Probability Distribution Shape

The test: Rather than the top probability value, examine the shape of the whole distribution. Maybe false statements show flatter distributions (uncertainty spread across many tokens) while true ones show sharp peaks.
The results: No consistent pattern. True and false statements produced overlapping distributions with no reliable distinguishing shape.

The likely reason: in this specific case the distribution shape reflects uncertainty about the next token, not about whether the current content is accurate. These are different questions.

Paraphrasing Tests

The test: Run the same factual and false content through multiple phrasings to check whether earlier structural signals were detecting meaning or just sentence structure.
The results: Inconclusive. Signals varied with phrasing, suggesting at least partial sensitivity to structure rather than content. The sample size was too small to say more.

This one mattered less as a standalone result and more as a caution about the earlier plausible pairs findings — phrasing differences between true and false pairs may have been a confounding factor all along.

Do Artificial Scenarios Produce Genuine Hallucinations?

These experiments shared a fundamental characteristic: they assumed that model uncertainty can be created by forcing the systems to generate falsehoods.

In forced completions, the model wasn’t actually uncertain — it simply processed imposed tokens normally.
In continuation cascades, it treated the false premise as given context rather than flagging it as an error.
The probability analysis measured prediction confidence, not factual accuracy.
And the paraphrasing results may have been confounded by structural differences between sentence pairs.

None of these tests captured what happens when a model organically struggles with uncertain information and generates false content as a result of that struggle. They were all examining scenarios where:

I imposed false content on the model.
The model processed that false content as ordinary context.
I measured output probabilities that don’t actually reflect whether current content is accurate.

This realization led to a new approach: instead of analyzing how the model processes false statements that sound plausible, I wanted to find tasks where the model would genuinely fail—where it would generate wrong answers not because false patterns were forced into it, but because it simply cannot compute the correct answer.

That’s what led me to switch to math problems in the next phase. LLMs struggle to accurately calculate relatively simple operations, such as math on six digit numbers, but they consistently claim that their results are correct. Keep an eye open for part 4, coming in the next couple of weeks.

Posts

In Search of Hallucinations, Part 3