Introduction
For years, each new release of a Large Language Model (LLM) has been accompanied by waves of hype—particularly around the claim that these models can write software. Headlines declare that programmers are obsolete and that AI will be generating all code within a few years. Most articles eventually back off from their attention-grabbing titles, but the core questions remain: How good are LLMs at writing code? Can they really replace professional engineers or are they just tools?
I wanted to find out for myself—and share the results with you.
LLMs and Programming
Large Language Models are trained on vast and increasing amounts of text—and that includes large volumes of code, written in many different programming languages. Their scale is increasing and their algorithms are updating. As a result, LLMs are improving both at generating human-like text (“chat”), and at generating code (“programming”).
They are evolving so quickly that anything I write today about the current state will feel outdated before long. That makes it even more important to understand the fundamentals, the trajectory, and the limits of the technology. Understanding where LLMs are now—and where they’re headed—can help us better anticipate what’s coming.
The Inspiration
I ran across a post by Harper Reed in which he presents a pragmatic approach to using LLMs for application development. There are many other interesting examples on the web. But Harper followed a structured approach which initially produces reference documentation before generating code. Since I am most interested in determining if the models can generate high quality production code It makes sense to follow a more thorough, professional approach.
Past Experience
This wasn’t my first time chasing the promise of LLM-based development. I’ve gone down this road before, with mixed results.
Simple Plotting Tool
A few years ago, I needed to generate charts for a presentation. I wasn’t familiar with matplotlib
at the time, so I decided to ask an LLM (an early version of ChatGPT) to create a simple plotting tool for me.
The model was friendly and upbeat as it worked through my prompts. Unfortunately, it was also confidently generating broken code. Fixing one issue often caused it to reintroduce an older bug. Since it couldn’t see the output of the plots, it would blindly guess—producing charts that were alternately too small, too large, or just plain wrong.
In the end my attempts to coax the model into generating reasonable, working code failed. I looked over its work, then consulted StackOverflow and other sources to manually debug and modify the code. In the end I created the solution I required.
Vue 3 Migration
My first experience made me a bit leery, but as time passed and the LLM hype continued, I decided to try again with a more ambitious project. I needed to update a code base for a private project, from Vue 2 to Vue 3 (a JavaScript UI framework). By this time Vue 3 had been out for six months. Long enough that I felt it was stable and ready to be the basis of the app I was working on.
Once again I worked with a friendly, cheerful LLM, attempting to find the proper prompt wording that would convince the model to translate the code without introducing unnecessary additions or changes.
In some ways the task was easier, as I was starting with a working code base. This meant that I knew exactly what I wanted the code to do, and had tests to verify it. But the move from Vue 2 to 3 included both syntax and paradigm shifts that the model struggled to properly comprehend. In most cases it could convert the code into Vue 3 semantics, but used the older Vue 2 syntax. My guess is that the vast majority of the code that the LLM had trained on was still in Vue 2 syntax, so it consistently favored that approach.
Training LLMs is expensive and time consuming, so it is not (currently) performed continuously. Models are likely to be slightly out-of-date, based upon their last training date. This is known as “knowledge cutoff”.
Sometimes, the LLM struggled to understand the source code at all. It would silently drop functionality or mistranslate key logic. And since it didn’t use my existing tests, it didn’t know when it had broken something. I had to run the tests manually, trace the failures, and then guide the LLM toward a fix by doing the hard parts of the diagnosis myself.
Third Time’s a Charm
In both cases the output code the LLM provided had some value as a starting point. But it was nowhere near complete, and certainly not something I would trust without a lot of revision. The results didn’t live up to the hype. Predictions that LLMs would soon replace professional developers seemed wildly premature.
Fast forward to this past winter when I read an article that talked about a more comprehensive approach to developing apps. In my daily usage of LLMs for research and writing I had noticed a substantial improvement in their ability to understand prompts, reason more effectively, and generate clear English text. So I was willing to believe that they might now be able to code effectively. Enough time had passed that the scars from my previous attempts healed, and I was ready to try again.
The Approach
This time, I followed a method outlined in Harper Reed’s blog, which suggests a comprehensive pattern for greenfield development (starting from scratch).
His strategy emphasizes detailed planning and definition of the project. Code generation comes only at the very end, once the model (and you) have a shared understanding of the full project. This is a solid development practice in general—not just when working with an AI!
The idea of pushing back code generation until a full plan is in place may sound obvious to experienced developers. But it’s a sharp contrast to how most people interact with LLMs—by jumping straight into “write me some code.”
What to Expect
In the following sections, I’ll walk through this process using a real-world example. I’ll share the prompts, the code that was generated, and my analysis of how well it worked.
Fair warning: this won’t be a quick read. If you’re looking for hype, this isn’t it. I’m diving deep into the actual process and results, because I think that’s the only way to really evaluate what LLMs can do. If you’re only interested in the bottom line, feel free to skip ahead to the final post in the series, where I’ll summarize everything.
Overview of the Process
Here’s a summary of the approach Harper Reed recommends:
-
State your goals. Tell the model what you want to accomplish—first by creating specifications, then code. Give it a high-level description of the application you want to build.
-
Let it interview you. Ask the model to prompt you until it believes it has a full understanding of your requirements. (In my case, it generated nearly 100 questions!)
-
Answer thoroughly. Respond to each prompt, adding clarifications, requirements, or constraints as needed. Push back when it misses the mark.
-
Generate a specification. Once all the information is gathered, ask the model to create a spec. This gives you a chance to check whether it’s understood the project—and forces you to refine your own thinking as well. So far, you and the model have been acting like project managers.
-
Request a development blueprint. Have the model use your answers and the spec to generate a high-level development plan. Think of this as the software architecture step.
-
Create the code-generation prompts. Ask the model to produce a full set of prompts for building the app.
-
Run the prompts and test iteratively. Use the generated prompts one by one, testing the code as you go. Don’t wait until everything is built—debug early and often.
The Application: Reproducibility Crisis Stats Extraction
While planning this project, I’d been reading about the reproducibility crisis in science. I became curious: How often do published papers actually share their underlying statistics, rather than just stating conclusions?
The reproducibility crisis refers to the growing recognition that many published scientific findings cannot be independently replicated and may be based upon faulty statistics—a serious problem for science’s credibility.
That led to an idea: What if I could automatically extract all relevant statistics from scientific papers?
That became the application I asked the LLM to help me build.
I started the process in ChatGPT. As the project evolved, I brought in Anthropic’s Claude to see how it compared. Most of the prompting and coding happened in March and April of 2025.
Initial Prompt
I based my initial prompt (below) directly on the sample wording Harper provided in his blog. I purposefully didn’t provide too much detail in the project description, as I wanted to see how well the LLM would query me for the appropriate information.
In retrospect (and 100 questions later), this might have been a mistake. The LLM took a very long time to figure out what I needed. While it was ultimately quite useful in producing a spec, I think that if you are trying to follow a more rigorous approach it makes sense to spend more time writing out a clearer initial description before engaging the model.
You said:
Ask me one question at a time so we can develop a thorough, step-by-step spec for this idea. Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time. Here’s the idea: <IDEA>I want to build a Python application that can scan research papers which are in pdf form to pull out all of the information about the statistics that were used in the paper. This includes any data, samples, hypothesis, parameters, statistics, and results. Output should be in structured JSON.<END IDEA>
Questioning
After my initial prompt, the LLM thought for a minute, then jumped into questions. Here are a number of examples. In the end it came up with exactly 100 questions! I won’t be showing them all here.
Let’s start with the basics:
Question 1: Do you need the application to support scanned PDFs that require OCR, or will it only need to process digitally generated PDFs with embedded text?
You said:
Initially I do not plan to require OCR. It would be good to keep track of that as a long term goal. The first version would work only with embedded text.
Question 2: For these research papers, should the application process the entire text, or would you prefer it focus on specific sections (like Methods and Results) where statistical details are most likely to be found?
You said:
I believe it would be better to scan the entire document.
Question 5: How would you like the application to handle cases where the paper doesn’t provide enough information for one or more of the statistical categories? For example, should the JSON output include a null value or an explicit flag for missing data, or simply omit that category?
You said:
Where information is missing the JSON data can omit the category. One field that would be useful in the JSON is an analysis on the completeness of the data present.
Question 20: Do you anticipate needing multi-language support, or will the application only need to process research papers written in English?
You said:
The first version will only work on English language papers.
Question 32: Should the script include inline documentation and docstrings for functions to improve readability and maintainability?
You said:
Yes
Question 73: Should the application attempt to extract any information about effect size interpretations (e.g., small, medium, large) if the paper explicitly provides them?
You said:
if they are provided
Finally, the LLM responded with:
ChatGPT said:
We’ve now developed a detailed set of specifications for your Python application! This covers a wide range of features, from statistical data extraction to error handling and logging. If you’d like, I can help summarize this into a more concise document or assist further with any next steps. Let me know how you’d like to proceed!
Discussion on the Questions
While the number of questions may seem excessive, this is not a problem for modern LLMs. They are able to consider much more context (more than 100 questions worth) compared to older LLM technologies. This is thanks to a significant increase over the past few years in the size of the context window, the amount of data that the model can ‘keep in mind’. LLMs can now use more information when generating responses, and are less likely to become overwhelmed or forget earlier details.
However, they’re still not perfect. For example, question 32 specifically asked about including docstrings—Python comments that describe methods—but these were not actually included in the (ChatGPT-generated) code until I reminded the model during the debugging phase.
Information within the context window can be considered and possibly influence the results. But just because something is within the context, particularly when this is a large span of information, doesn’t guarantee that the model will definitely remember or prioritize it.
Considering a Different Approach
In a sense, this LLM fell into the same trap that human teams often do: trying to over-specify all functionality up front. With so many details packed into the early stages, the resulting code was more complex and harder to debug. This is partly a flaw in the process I followed, which implicitly suggested to the model that it should gather all details at the beginning.
A more efficient approach might be to have the system generate a simpler demo or MVP first. Something that can be tested and iterated upon, with more functionality added over time. This approach has always been a good way to validate ideas early, but in the past was difficult to implement due to the engineering and UX effort required. With LLMs now able to generate working code in minutes, this iterative, experimental process becomes much more accessible and appealing.
End of Part 1
Many of my posts over the past couple years have been longer, as I dive deep into various rabbit holes. I’m exploring these topics to gain a deeper understanding, so it makes sense to share some of that depth and nuance with readers who are interested in my blog. But it does take a while to wade through the information, extract the relevant insights, and present what I believe is a useful and interesting overview.
The next installment, where I have the model generate the specification, will be posted soon. Come back to see if the spec lives up to expectations!