This is the third post in a series exploring how well large language models (LLMs) can write real-world software. In Part 1, I explained the motivation behind this investigation and outlined the development process I would follow (inspired by Harper Reed). My application of choice: a program that automatically extracts statistical data from academic PDF papers. In Part 2 I collaborated with the LLM to define the full scope of the application, refined the generated documentation, and received a set of structured prompts to use for code generation.
The Code
Now it was finally time to work with the actual code. I fed each code-generation prompt into the models (ChatGPT and Claude), then compiled, ran, and tested the resulting code.
You can find the full output of this exercise on GitHub:
Prompt 1: Basic Setup
I handled the initial project setup manually, making a few small adjustments for my Mac-based environment.
The first generated prompt focused on establishing the directory structure and initializing the parser. This step was a good sanity check—it confirmed that the environment was properly configured and that dependencies were in place before diving into more complex functionality.
The initial code snippet worked correctly on the first try, which was encouraging. That said, I noticed some differences in quality between the models. ChatGPT’s code was functional but barebones—missing elements like comments, tests, and logging. Claude, on the other hand, took a more thoughtful approach and included foundational pieces like basic logging and docstrings right from the start.
Prompt 2: Text Extraction
With the environment in place, the second prompt began implementing real functionality—starting with PDF text extraction.
Both ChatGPT and Claude opted to use PyMuPDF, imported as fitz, to extract text from PDFs. I hadn’t used the package before, so I was impressed that both models selected and implemented it without any prompting. In hindsight, I should have asked the models to explain why they chose this package and what tradeoffs it might entail—a best practice when introducing new dependencies.
Historically, I’ve struggled to get even simple model-generated code running, so I was pleasantly surprised to see this step work without issue.
Again, Claude’s output was again more complete. It included basic exception handling and logging, while ChatGPT offered a minimal implementation. Still, the tests provided by both models remained superficial. For example, Claude verified behavior with missing files, but neither model attempted to simulate the messy edge cases you’d expect from real-world academic PDFs. The overall feel was something like a capable intern ran a smoke tests but had not thought to tackle the tougher scenarios.
Prompt 3: Cleaning the Text
At this stage, the divergence between ChatGPT and Claude becomes more pronounced.
ChatGPT followed the prompt instructions in a literal, almost minimalist way. Its clean_text() function removed extra spaces, stripped newlines, and normalized UTF-8 characters—but not much more. Claude’s implementation was far more thorough, applying around eighteen distinct cleaning steps, including:
- Handled a range of whitespace issues
- Removed encoding artifacts and control characters
- Fixed possible spacing issues in numbers and punctuation
Both models leaned heavily on regular expressions (regexes), attempting to anticipate and clean up common patterns found in academic texts. But this approach has its limits. Regex-based logic can be fragile—especially when parsing semi-structured documents where formatting conventions vary widely. What works for one paper might completely break on the next.
It would take real research to verify whether these cleaning steps are both necessary and sufficient, and whether the regexes are reliable across varied source material. Unfortunately, the tests generated by both models were too shallow to provide that confidence.
Verifying the Code
This raises one of the central challenges of LLM-generated code: even if it looks plausible, is it actually correct, complete, and robust?
If I were preparing this code for production, this is where I’d pause. I would ask the author, whether human or AI, to walk me through each decision: Why these cleaning steps? What edge cases were considered? What was left out?
At a minimum, I’d expect them to:
- Generate both positive and negative test cases for each cleaning rule
- Justify each regex and transformation
- Perform basic regression testing on a few known example documents
Compare this generated code to another source for software: traditional open-source libraries. When I use a mature library, I can consult documentation, GitHub issues, Stack Overflow posts, and community reviews. I can assess maintenance activity, test coverage, and even contribute back fixes if needed.
By contrast, AI-generated code is more like grabbing a dependency from an obscure repo with no README. It might be brilliant—or it might silently fail in subtle, hard-to-detect ways. Either way, you are now the first person responsible for vetting it.
Prompt 4: Detecting Sections
So far, I’ve said that Claude produces more sophisticated results than ChatGPT—but left it up to you to explore the differences on GitHub. Let me now show you a concrete example.
The fourth prompt asked both models to:
“Detect section headings heuristically (e.g., ‘METHODS’, ‘RESULTS’).”
Here’s how each model approached the task:
ChatGPT:
if re.match(r’^(METHODS|RESULTS|DISCUSSION|CONCLUSION|REFERENCES)$’, line, re.IGNORECASE):
Claude:
COMMON_SECTIONS = {
‘abstract’: [‘abstract’, ‘summary’],
‘introduction’: [‘introduction’, ‘background’],
‘methods’: [‘methods’, ‘methodology’, ‘materials and methods’, ‘experimental’],
‘results’: [‘results’, ‘findings’, ‘observations’],
‘discussion’: [‘discussion’, ‘analysis’, ‘interpretation’],
‘conclusion’: [‘conclusion’, ‘conclusions’, ‘concluding remarks’],
‘references’: [‘references’, ‘bibliography’, ‘works cited’]
}
Claude’s approach clearly reflects a more robust design. It accounts for semantic variety—recognizing that “Materials and Methods” might serve the same purpose as “Methods”, or that “Findings” might align with “Results.” ChatGPT, by contrast, uses a much narrower match: if the line doesn’t exactly say “RESULTS” or “METHODS,” it’s ignored.
This is a great illustration of what I’ve seen repeatedly: Claude often tries to generalize and interpret, while ChatGPT often just matches and repeats.
Why Not Use an LLM?
One thing stood out to me here: neither model suggested using an LLM to parse the section headers despite the fact that this is exactly the sort of fuzzy, context-rich task that LLMs excel at.
Instead, both relied entirely on regexes and string matching, a strategy more at home in a 1990s script than in a 2025 ML-powered toolchain. It works fine as long as the formatting is predictable and consistent, but struggles when structure is looser or terminology varies.
I was a little surprised. You’d think this would be a perfect use case for an LLM-based subcomponent—especially since we already have one in the loop! Parsing natural-language sections to detect headings, extract statistics, and understand flow is where LLMs shine.
What if I redid the implementation, but specifically directed the models to use LLMs to process the text? I suspect the results would be far more flexible and find more data. But they would also be slower, since you’d need to send blocks of text to a backend model for inference.
It would also introduce complexity around testing. The current model-generated code is deterministic—given the same input, it always returns the same output. But LLM-in-the-loop systems can behave differently depending on model version, context window, or even temperature (randomness) settings. You might also have to deal with hallucinations—imaginary headings or values invented by the model based on inference rather than facts.
Prompt 5: Extracting P-Values
At this point, the prompt instructed the models to generate code to extract what appeared to be p-values, a common statistical metric found in scientific research. The prompt explicitly specified the use of regular expressions, and with that, all the limitations of regex-based approaches applied.
Even as the application grew more complex with each prompt, the code remained largely well-formed and executable. As a senior engineer reviewing the output, I suggested improvements to enhance algorithmic flexibility, modular code structure, documentation, logging, and testing.
The LLMs responded reasonably well to this interactive feedback. Most of the time, they incorporated my suggestions correctly. But, as I’ve seen in previous experiments, they occasionally forgot what they’d already done. For instance, I once asked ChatGPT to add logging, which it did. But in the process, it removed a necessary import, resulting in code that no longer compiled. This kind of regression is a recurring issue: LLMs sometimes think locally, applying your last instruction without reliably maintaining global coherence.
Prompt 6: Extracting Statistical Values
This prompt aimed to extract various statistical metrics from the documents, as always, using regular expressions.
Claude continued to generate basic tests, but as its code became more intricate, the testing failed to keep pace. At this point, the implementation was too complex for a quick visual scan to determine correctness, so I created my own tests using real-world inputs: a handful of scientific papers pulled from the web.
The results were spotty.
Scientific papers are inconsistent in structure and content. They don’t follow a uniform layout, and they don’t always include the kinds of data the apps were designed to find. As anticipated the regexes the models relied on were too rigid and simplistic, often capturing only the most obvious pieces of information.
Despite this, the models continued to confidently churn out code and accept my feedback, oblivious to the fact that the results were, in practice, not particularly useful.
As I’ve touched on before, this is a fundamental limitation of current LLMs: they can mimic understanding, but they don’t possess intent or contextual awareness in the human sense. They don’t step back and ask, “Wait—is this solving the actual problem?” They just keep following instructions.
You can prompt, prod, or rephrase your goals, and sometimes, if you find the right wording, they’ll inch closer to your intent. But other times they’ll veer off course, hallucinate logic, or forget what they’re doing. It feels like working with a disengaged contractor—they’ll do what you say, but without ever deeply investing in the project’s purpose.
Prompt 7: Outputting JSON
The seventh prompt was straightforward: format the extracted results as structured JSON output.
No surprises here. Both ChatGPT and Claude handled the task smoothly, producing clean, well-structured JSON. This was a welcome change of pace.
Sidebar: Returning to the Session
One small but relevant note: after running a marathon session with ChatGPT, I eventually had to sleep. When I returned a few days later, I was worried I’d have to re-feed all the previous context for the model to pick up where we left off.
When I asked ChatGPT about this, it said:
I don’t have memory of past conversations, so I don’t have the context from your previous prompt. But if you give me a quick refresher on the PDF extraction spec, I can pick up where we left off!
This answer is technically true for the model itself—but it’s no longer the whole story.
ChatGPT now includes a sidebar that stores your previous threads, so I was able to click on the “PDF Stats Extraction Spec” conversation and jump right back in with the full context restored. It’s a subtle but incredibly helpful UI feature. One that makes iterative, multi-session development with LLMs much more feasible.
The Final Prompt
The so-called Final Prompt didn’t introduce much new functionality. For ChatGPT, it finally added logging, rather late in the process.
At this point, I ran my own test papers through the ChatGPT application and nothing came out.
This led to a not-so-surprising realization: it’s much harder to debug code you didn’t write. While I had been reviewing the code throughout, this was the first time I needed to seriously dive in and troubleshoot.
I started with the most obvious next step: asking the model to diagnose the issue. Its suggestion? Add print() statements at key locations to inspect intermediate results. Technically valid—but also a rather old-school approach.
I then asked whether the model could guide me through using a debugger instead. It responded that it couldn’t interact directly with a debugger, but could walk me through the steps required to use one.
What followed was a familiar back-and-forth: I’d try a suggestion, report back, and it would suggest the next move. In the end, much like my earlier LLM coding attempts, I abandoned the model and quickly tracked down the bug myself.
I remain unimpressed by current LLM debugging support. The model can offer help, but it’s procedural and shallow, lacking intuition or initiative.
Enhancement Attempts
After getting the application to a working state, I began requesting refinements. Unfortunately, the results were mediocre.
For example, the print() statements suggested during debugging probably should have been structured as logging calls from the beginning. When I asked the model to convert those into proper logs and improve the logging overall, it complied—but only updated the areas I explicitly mentioned. It added no additional logging where it might logically belong. Again, it behaved like a junior engineer: doing what was asked, but not taking initiative to improve the broader system.
I also asked for better comments throughout the code. After a few iterations, the quality did improve, but not dramatically. The final comments were technically accurate, but bland, and didn’t offer much insight into why the code was structured the way it was. I could have more quickly produced better comments myself.
Like the logging, the model did a decent job with very specific tasks, but it struggled to generalize feedback into holistic improvements.
End of Part 3
At this point, I’ve walked through the full process—from initial idea to a working (if imperfect) implementation, using only LLMs as development partners. Along the way, I encountered moments of real progress, plenty of frustration, and a few unexpected lessons.
In the next installment, I’ll take a step back and try to answer the big questions:
Where do LLMs work best for coding? Where do they fall short? Could an LLM replace a professional developer, or are they just tools to help developers?
I’ll unpack what worked, what didn’t, and what it all might mean for the future of coding with AI.