LLMs Don’t Work the Way You Think They Do, Part 2

In the previous post I shared a story about my attempts to produce guitar tabs with Claude, Anthropic’s LLM. My initial pleasure in generating dozens of practice tabs gave way to frustration as I tried to convince Claude to fix existing errors and avoid the same mistakes when generating new tabs.

But my attempts to guide, instruct, or even command Claude to do as I wanted were doomed to fail. The fluency, power, and versatility of LLMs might lead you to believe that they can do anything. They come across as helpful, cheerful, and interested, so it seems like they want to solve your problems. The reality is quite different.

In my previous post I debunked common misconceptions about LLMs, but I didn’t explain how they actually work. I’ll start this post by filling in that gap, then continue on to provide more context on LLM processing, limitations and implications. Hopefully this information will help you to understand why Claude failed with tab generation and would likewise struggle with a wide range of applications!

How LLMs Work

LLMs are prediction machines, like autocomplete on steroids. They have been trained on massive amounts of data to figure out the next most probable word. In some of my other recent postings (CoT, Positional Encoding) I dove deep into the (very cool) technologies that form the basis of LLMs. Here I’ll stick to higher level descriptions and analogies.

The Basic Process: LLMs work by predicting the most likely next “token” (roughly a word or part of a word) based on all the tokens that came before. They do this by calculating probabilities across their entire vocabulary for each position.

Training Creates the Probabilities: During training, the system analyzed billions of text examples, learning patterns like:

  • “The cat sat on the…” → “mat” (high probability)
  • “Please write a…” → “letter/email/story” (various high probabilities)
  • “2 + 2 =” → “4” (very high probability)

Generation is Probability Sampling: When generating text, the LLM:

  1. Looks at all previous tokens in the conversation (the context)
  2. Calculates probabilities for every possible next token
  3. Samples from these probabilities (with some randomness)
  4. Adds the chosen token and repeats

As pointed out above LLMs can perform basic math only because they have seen similar equations (2 + 2 = 4) in their training data.  Ask an LLM to add two 10 digit random numbers you make up. These will not be ‘trained into’ them. The LLM will either fail the test or need to access additional external functionality (such as a math plug in) to accomplish this.

I mentioned that LLMs are somewhat like autocomplete, on steroids. Remember how bad autocomplete was a few short years ago? At the time it would always return the same (probably useless) suggestions. Now we expect autocomplete to remember common words we use, be savvy about the conversation context, and understand proper sentence structures. LLMs are far better–they have an incredibly deep training on word meanings in a large number of contexts and can analyze your entire chat input to come up with a series of  ‘autocomplete’ suggestions so compelling that they seem to be talking with you.

This process might seem similar to human thinking. We learn patterns from experience, make quick decisions based on incomplete information, and often go with our best guess. When we see something large approaching quickly on the road, we don’t analyze the situation logically—we assume it’s probably a car and get out of the way. A deeper dive reveals other similarities, but many more substantial differences.

Humans vs LLM

In 1950 Alan Turing proposed a test to determine if an AI exhibits intelligent behavior indistinguishable from that of a human. By most measures LLMs readily pass this bar. But that does not mean that LLMs actually think like a human.. 

Where They’re Similar

Both humans and LLMs (surprisingly) are good at creativity. Much creativity involves connecting previously unlinked ideas. Darwin combined Malthus’s population theory with his biological observations to develop natural selection. LLMs do something similar, blending patterns from their training to create unexpected metaphors or novel solutions.

Both can also work within constraints to find creative solutions, and both can surprise even themselves with what they produce.

Where They’re Fundamentally Different

Learning and Memory: You continuously update your knowledge throughout your life, forming new memories and changing your mind about things. LLMs can’t form new long-term memories or truly learn from individual conversations—they’re frozen at the moment their training ends.

Understanding the Physical World: Your thinking is shaped by having a body that interacts with the physical world. You understand concepts like “heavy” or “smooth” because you’ve experienced them. LLMs process text without ever experiencing anything physical.

Self-Awareness: You can think about your own thinking, recognize when you’re uncertain, and consciously direct your attention. You monitor your confidence levels and adjust accordingly. LLMs can talk about self-awareness, but they don’t actually experience it.

Genuine Intentions: You think with real goals, desires, and motivations that persist over time. When you create something, it often serves personal purposes—processing emotions, exploring identity, expressing love. LLMs generate responses based on patterns but have no genuine intentions beyond the immediate conversation.

Breaking Rules: Here’s a crucial difference for creativity: humans can deliberately reject their training. Picasso mastered realistic painting before developing Cubism. Joyce wrote conventional stories before creating Ulysses. This kind of systematic rule-breaking requires understanding conventions deeply enough to transcend them.

The fundamental difference is that human thinking evolved for survival in complex social and physical environments, while LLMs were optimized solely for predicting text patterns. This creates completely different relationships to knowledge, uncertainty, and creativity itself.

One final similarity between humans and LLMs: both can make up information, be inconsistent, not know stuff, use poor reasoning, and handle tough topics poorly. But we expect the LLMs to handle this better!

How LLMs  Address Their Limitations

Rather than trying to reproduce human thinking patterns, LLM creators focus on reducing specific problems: 

  • hallucinations (confident false statements), 
  • inconsistency (different answers to identical questions),
  • knowledge gaps (outdated or missing information), 
  • lack of reasoning (jumping to conclusions without logical steps), 
  • safety issues (harmful or biased outputs).

They’re tackling these challenges through several categories of approaches:specialized training for specific domains, structured prompting techniques to encourage step-by-step thinking,  connecting LLMs to external information sources (RAG), built-in tools for precise calculations, and safety guardrails to filter problematic content.

Instruction Tuning

After basic language patterns are learned, additional training steps teach the LLM patterns it should ideally follow. Through supervised fine-tuning they are exposed to examples of questions and answers, reasoning patterns, step-by-step approaches, human-like approaches, structured responses, etc. They are also trained on basic safety and refusal approaches so that they are more likely to handle sensitive questions more appropriately.

Again, there is no guarantee that the LLMs will always follow these patterns, but the deep exposure trains the networks core weights in such a way that they are more likely to do so.

Specialized Training and Fine-Tuning

For specialized LLMs, this is taken a step further and they are trained on domain-specific data. Medical LLMs trained primarily on medical texts, coding LLMs trained on code repositories, legal LLMs trained on legal documents. 

This approach improves performance in narrow domains by strengthening the probability patterns most relevant to specific tasks. A legal LLM is more likely to generate appropriate legal language because it has seen more legal examples and fewer cooking recipes.

But specialization doesn’t eliminate the core issues—a legal LLM can still hallucinate case law that doesn’t exist, just with more convincing legal terminology.

Chain of Thought and Reasoning Techniques

These prompting techniques try to address the lack of logical reasoning by encouraging the LLM to “think step by step” or show its work. Instead of jumping straight to an answer, the system generates intermediate reasoning steps.

For example:

  • Regular prompt: “What’s 15% of $847?”
  • Chain of thought prompt: “What’s 15% of $847? Think step by step, explaining each step as you proceed”

I described the chain of thought approach in great detail in previous posts. In practice the prompt is usually far more extensive than the example above, suggesting many steps and processes that the system should follow. It often produces good results because it mirrors the step-by-step reasoning patterns the LLM saw during training. It has been so successful that it is usually baked-in to LLMs through system prompts and fine tuning.

However, it’s still pattern matching—the LLM isn’t actually reasoning. Rather, the LLM is influenced by the chain of thought instructions to generate text that looks like reasoning. This works almost like thinking out loud: because the LLM is generating logical sequences of steps it is influencing itself by adding more information to its context, which it processes and in turn uses to generate more steps and ‘reasoning’, eventually (ideally) resulting in answers that better fit the initial query.

System Prompts

System prompts are instructions given to the LLM before the user interaction begins, setting the tone, role, and behavior guidelines. These invisible prompts shape how the LLM responds by providing consistent context that influences the probability calculations.

A system prompt might say: “You are a helpful assistant. If you’re not sure about something, say so rather than guessing. Always double-check your work when providing technical information.”

This doesn’t make the LLM actually more careful or honest—it just makes responses that sound careful and honest more probable based on the training patterns.

RAG (Retrieval-Augmented Generation)

RAG systems connect LLMs to external information sources. When you ask a question, the system first searches through databases, documents, or the web to find relevant information, then feeds that information to the LLM as context for generating its response.

This information can help to guide the LLM to the proper response, and it provides verifiable sources that can be consulted.

However, RAG systems still suffer from the fundamental LLM limitations—they can misinterpret the retrieved information, confidently state incorrect conclusions based on good sources, or fail to distinguish between relevant and irrelevant retrieved content.

Built-in Tools and Function Calling

Modern LLMs are often connected to external tools: calculators, web search, databases, APIs. When the LLM determines it needs to perform a calculation or look up current information, it can “call” these tools and incorporate the results.

This is powerful for addressing specific limitations. An LLM with calculator access can actually solve “What’s 847 × 329?” correctly instead of guessing based on pattern matching.

However, the LLM still needs to correctly identify when to use tools, which tools to use, and how to interpret the results—all probability-based decisions that can go wrong.

MCPs (Model Context Protocols)

More flexible than building in certain functionality are plugin-style architectures for LLMs. MCPs  provide standardized ways for LLMs to interact with external systems—databases, APIs, file systems, or specialized tools. They create reliable interfaces between the probability-based LLM and deterministic external systems.

Think of MCPs as translators that convert LLM outputs into precise commands for other systems, and convert system responses back into context the LLM can use.

These protocols help with reliability and functionality but still depend on the LLM making correct decisions about when and how to use external systems.

Realizing that Claude would never be able to solve some of the alignment and formatting issues that it faced when trying to generate guitar tabs, I created an MCP to generate the tabs. Claude learns how to use the MCP from the instructions embedded in the project then passes JSON representing the desired tab to my server. Interestingly, Claude rarely makes mistakes when generating JSON–I assume that is because it has seen vast amounts of data in this format during training. This MCP is publicly available–you can connect it to your LLM of choice and use it to generate tabs for stringed instruments.

Guardrails and Safety Measures

Guardrails are systems designed to prevent harmful, biased, or incorrect outputs. They might filter inputs and outputs, reject certain topics, or add warnings to responses.

These systems work alongside the LLM, not by changing its fundamental behavior but by adding layers of checking and filtering. A guardrail might detect that an LLM is about to provide medical advice and either block the response or add disclaimers.

Guardrails help with safety but can’t solve accuracy problems—they’re generally better at preventing obviously harmful content than catching subtle errors. They can also be used to censor information

Memory Systems

Some LLMs now include persistent memory that carries information between conversations. ChatGPT’s memory feature can remember that you’re working on a Python project, your preferences, or ongoing topics. Claude has a ‘project’ concept where you can upload files you want to ensure are part of the context.

This isn’t the LLM developing human-like memory—it’s more like an automatically-maintained notepad that gets included in the context of future conversations. The LLM still processes each interaction as probability-based token prediction, but now with additional context about your history.

Memory systems improve user experience but don’t change the fundamental nature of LLM responses.

Improvements But Not Solutions

All these techniques represent genuine improvements in LLM capabilities and reliability. RAG provides access to current information, specialized training improves domain performance, tools enable precise calculations, and guardrails reduce harmful outputs.

But these techniques work around the limitations rather than eliminating them. Only recently have I seen work that can potentially work with the core logic of LLMs to improve reliability, by detecting likely hallucinations as they happen. Until this type of research bears fruit, an LLM with perfect tools and guardrails is still generating responses by predicting the most likely next token based on patterns, not by reasoning, understanding, or genuine knowledge retrieval.

Users of LLMs face one more challenge in evaluating and trusting these systems: while the broad strokes of these mitigation techniques are known, LLM companies do not fully disclose what they’re doing to address these concerns. Most LLM companies implement what amounts to security through obscurity. For reasons ranging from competitive advantage to regulatory uncertainty to possible inadequacies in their solutions, these companies keep their LLM management efforts largely secret. You don’t really know what they’re doing to protect you, or whether they’re actually censoring you inappropriately. This violates Kerckhoffs’s Principle—that security systems should be transparent about their methods—making these systems less trustworthy overall.

Given all of these efforts, it is clear that LLM producers understand that the awesome power of their products also makes it harder to ensure that they are safe, secure, and predictable. They’re deploying numerous techniques, attempting to maintain that power and flexibility while producing reasonable, consistent, safe, and useful outputs. Over time, these efforts have improved results considerably. But it’s unclear whether LLMs can ever be fully contained or made completely reliable. 

As my guitar tab frustration demonstrated, when the underlying LLM is simply wrong about something, none of these protective layers provide any relief—they can only work around problems they can detect, not problems that look perfectly reasonable but are simply wrong.

Implications

LLMs are being used in a rapidly increasing number of scenarios, so understanding how they actually work is becoming essential. Many people and companies are misusing them by treating them as friendly, perfectly predictable and infallible sources of truth, simply because they don’t understand their limitations or how they really behave.

It’s interesting to understand these systems, but also critical. Companies are rushing to implement LLM-based systems for customer service, content generation, or even automated decision-making without accounting for the fundamental unpredictability. These projects risk expensive failures not because the technology lacks power, but because it lacks the reliability these applications require.

The key point of this entire article is this: The same probabilistic mechanism that gives LLMs their amazing power and versatility also ensures that they cannot be completely predictable, trustworthy, safe, and secure. It is dangerous to treat them as such.

The LLM success stories share a common pattern – they work WITH the nature of LLMs rather than against it. GitHub Copilot succeeds because developers naturally review code suggestions the same way they’d review any colleague’s work. Grammarly works because they emphasize collaboration. The failures occur when people assume LLMs are reliable, helpful, autonomous decision-makers rather than sophisticated pattern-matching tools that require human judgment.

I use multiple LLMs every day for a range of tasks. But I always treat them like a brilliant assistant who can sometimes rapidly produce impressive work but requires constant, careful supervision. Because I understand how they actually work.

Quick update: Don’t take it from me—consider OpenAI’s own recent research. In “Why Language Models Hallucinate” (September 2025), they confirm that hallucinations aren’t bugs to be fixed, but inevitable consequences of how LLMs are trained and evaluated.