The hype surrounding LLMs suggests that professional developers may soon be obsolete—that tools like ChatGPT and Anthropic’s Claude can now handle all your coding needs with just a few prompts. Obviously, that’s an exaggeration. But what is the real state of things? How good are LLMs at writing code today? Can they truly replace professional engineers, or are they better thought of as productivity tools?

I recently set out to investigate this again following what I hoped would be a more rigorous and ultimately more successful approach (Harper Reed’s blog on this topic) than some of my previous attempts. In Part 1, I shared an overview of this proposed process, described the project idea, and walked through the 100 questions ChatGPT asked to clarify the requirements.

The Specification

With the questioning phase complete, it was time to generate the project specification. Once again, I followed Harper Reed’s suggested format when crafting my prompt:

You said:

Now that we’ve wrapped up the brainstorming process, can you compile our findings into a comprehensive, developer-ready specification? Include all relevant requirements, architecture choices, data handling details, error handling strategies, and a testing plan so a developer can immediately begin implementation.

I’m including the full specification below, since it’s far more concise than the preceding question-and-answer session. In addition, the blogs and videos I’ve looked at skip this still. I believe that generations of specifications and documentation is a good use case for LLMs, so it is worthwhile to see how they perform.

If this were a spec produced by a product management team I would point out that they have included both the general requirements (good) and specific implementation details (bad–this is the job of the developers). I prefer to separate this type of information as they are generally owned by different groups (Product Management vs. Engineering).

But this specification was both generated by an LLM and will be interpreted by the same LLM. It’s not clear that models struggle with the kinds of organizational boundaries or role distinctions that humans do. In fact, I suspect that a more structured and consistent spec, possibly one that blends planning and technical guidance, may actually help the model generate better, more complete code.

This is a key takeaway when working with LLMs: While they can produce surprisingly impressive output, the quality of that output still depends heavily on the completeness, clarity, structure, and specificity of your prompt.

Specification Output

The LLM responded with the following document. I’ve slightly compressed the text to keep it manageable, but for the most part, the model did a good job organizing and formatting the material.

ChatGPT said:

Sure! Here’s a comprehensive developer-ready specification based on our discussion:

Python Application for Extracting Statistical Data from Research Papers (PDFs)

Objective:

Develop a Python application that extracts statistical information from research papers in PDF format, structures the data into a hierarchical JSON format, and logs the extraction process for analysis and debugging.

Key Functional Requirements:

- Input and Output:

- - Input: Path to a single PDF file.

- - Output: A JSON file with the same name as the input file but with a .json extension.

- - JSON file should be human-readable (indented format).

- - The JSON will contain extracted statistical data organized in a structured format.

- Data to be Extracted:

- - Statistical Tests: Type of test (e.g., t-test, ANOVA), p-values, effect sizes, confidence intervals, degrees of freedom, etc.

- - Data Summary: Sample sizes, means, standard deviations, standard errors, and any other relevant data.

- - Regression Analysis: Coefficients, R² values, model significance, etc.

- - Effect Sizes: Including Cohen’s d, odds ratios, and any other effect sizes mentioned.

- - Bayesian Analyses: Bayes factors, credible intervals.

- - Missing Data: Handling and methods for missing data (e.g., imputation).

- - Outlier Handling: Any mention of outliers and how they were handled (e.g., removal, winsorizing).

- - Test Assumptions: Assumptions such as normality, homogeneity of variance, etc.

- - Data Availability: Information regarding data sharing or availability if mentioned.

- - Statistical Software/Packages: Any tools or software used for analysis.

- - Statistical Results Context: Indicate whether results are exact values or thresholds (e.g., p < 0.05).

- - Notes on Linkage: If statistical data is linked across multiple sentences or sections, a note is included.

- JSON Structure:

- - Each type of statistical test or analysis should have a dedicated section in the output (e.g., “statistical_tests”: [], “regression_analysis”: [], “effect_sizes”: []).

- - The output should also contain:

- - “summary”: A section summarizing the extracted data, including counts of key statistics and any missing values.

- - “data_completeness_analysis”: A report on missing data, key omissions, and potential inconsistencies.

- - “outlier_handling”, “missing_data_handling”, “test_assumptions”, and other relevant sections for specific methods or handling.

Architecture and Design Choices:

- - Libraries and Tools:

- - - PDF Text Extraction: Use PyMuPDF (also known as fitz) for extracting text from PDFs.

- - - Data Parsing: Regular expressions (regex) will be used to extract specific statistics, such as p-values, effect sizes, and sample sizes.

- - - NLP Techniques: Use basic NLP to identify and link related pieces of information across sentences or sections, based on proximity.

- - - JSON Output: Utilize Python’s built-in json library to structure and write the output.

- - Preprocessing:

- - - Extract all text from the PDF, and clean the text to remove unnecessary characters or artifacts (such as extra spaces, page numbers).

- - - Segment the text into paragraphs, sentences, and identify key areas (e.g., Methods, Results) to improve extraction accuracy.

- - Data Extraction:

- - - Use regex to find statistical keywords, numerical values, and tests (e.g., p-values, t-tests, ANOVA).

- - - Extract information from both inline text and tables (if possible).

- - - Extract confidence intervals, degrees of freedom, and any reported data transformations (e.g., log transformations).

- - Error Handling:

- - - Minor Errors: If extraction fails for a specific test or result, log the error and continue processing the document.

- - - Major Errors: If the file cannot be opened, text cannot be extracted, or a system error occurs, log the error and terminate the application.

- - - Logging Levels: Include configurable logging with multiple levels (e.g., ERROR, WARNING, INFO, DEBUG). Logs should be written to a log file.

- - Completeness and Validation:

- - - The application will include a data_completeness_analysis section to highlight any missing key values (e.g., missing sample sizes or p-values).

- - - The application will log and report potential issues with extracted data, including inconsistencies or omitted statistical values.

- - Logging:

- - - Log both successful and unsuccessful data extractions.

- - - Include detailed logs for debugging (e.g., which tests were found, which could not be extracted).

- - - Use a log file to store output, with error levels and detailed messages.

Command-Line Interface:

- Input:

- - Command: python extract_stats.py /path/to/input.pdf

- - Input file path should point to a single PDF file.

- Output:

- - A JSON file with the same name as the input PDF, with .json extension.

Error Handling Strategy:

- Minor Errors: These will be logged with a warning message and will not stop the execution of the script.

- Major Errors: These will terminate the program after logging the error.

- Examples of major errors:

- - No file found.

- - Text extraction failure.

- - Unexpected system errors.

Testing and Verification:

- Unit Testing:

- - Use Python’s unittest framework to write tests for the following:

- - - Extraction of statistical tests and results.

- - - JSON formatting and structure.

- - - Logging functionality.

- - - Handling of missing data or incomplete tests.

- Mock Data:

- - Unit tests should use mock data to simulate PDF content (statistical tests, sample sizes, p-values, etc.).

- - Focus on individual functions and processing logic before testing on real PDF files.

Future Considerations:

While the current version of the application focuses on basic extraction from PDFs and does not support extensibility, future versions may include:

- OCR support for scanned PDFs.

- Multi-file processing capabilities.

- More advanced NLP techniques to improve data linkage and extraction accuracy.

Exit Codes:

- 0: Successful execution.

- 1: Errors encountered in file processing or extraction.

The application should return an appropriate exit code to indicate success or failure, useful for integration into automated workflows.

Dependencies:

- PyMuPDF (fitz): PDF text extraction library.

- regex: For parsing statistical data.

- json: For formatting output as JSON.

- unittest: For writing and running unit tests.

This document outlines the entire specification for the application, including functional requirements, design decisions, error handling strategies, and testing considerations. A developer can now begin implementing the application based on this comprehensive guide. Let me know if you need any changes or additional details!

Specification Discussion

At this point, I reviewed the specification and suggested a number of changes and clarifications to the LLM.

One of the risks of the back-and-forth questioning process is that it can create the illusion that the LLM truly “understands” the topic—like a human would. It does not. LLMs are powerful in some ways that surpass human capabilities, but their conceptual understanding is fundamentally different. You need to be very explicit in your instructions and ensure that all critical details are clearly spelled out. What seems obvious to you may not register with the model at all.

The Blueprint and Prompts

Once I had accepted the revised specification, I again followed Harper Reed’s approach by asking the system to generate a blueprint for implementation, as well as the specific prompts it would use to execute that blueprint.

This raises a natural question: is the blueprint step even necessary? Why not jump straight from specification to prompts?

Here’s the best rationale I can offer: LLMs treat previous outputs as part of their working context. This context functions like a short-term memory, shaping how they respond to new prompts. If key ideas from the spec aren’t explicitly restated, or synthesized, into a structured plan, the model may not fully incorporate them into its next steps. What seems like a logical progression to a human isn’t always obvious to an LLM.

Explicitly laying out each step and asking the model to generate a structured plan before jumping into code yields far better results. Think of it as scaffolding—each layer supports the next.

Because context builds up across your interactions, it’s also important to manage it intentionally. If you pivot from coding to something completely unrelated—say, choosing a recipe for dinner—you’re better off starting a new session. Otherwise, the model might start blending contexts, which rarely ends well. LLMs won’t automatically compartmentalize unless you tell them to.

Even with all these precautions, today’s LLMs are still imperfect. They’re far more consistent than earlier versions, but they can still forget prior points or even fabricate details entirely (“hallucinate”). The risks are lower, but they haven’t disappeared.

Blueprint and Prompt Output

I prompted the LLM as per Harper’s example:

You said:

Draft a detailed, step-by-step blueprint for building this project. Then, once you have a solid plan, break it down into small, iterative chunks that build on each other. Look at these chunks and then go another round to break it into small steps. review the results and make sure that the steps are small enough to be implemented safely, but big enough to move the project forward. Iterate until you feel that the steps are right sized for this project. From here you should have the foundation to provide a series of prompts for a code-generation LLM that will implement each step. Prioritize best practices, and incremental progress, ensuring no big jumps in complexity at any stage. Make sure that each prompt builds on the previous prompts, and ends with wiring things together. There should be no hanging or orphaned code that isn’t integrated into a previous step. Make sure and separate each prompt section. Use markdown. Each prompt should be tagged as text using code tags. The goal is to output prompts, but context, etc is important as well. <SPEC>

ChatGPT said:

Let’s begin with a detailed step-by-step blueprint for implementing this project. This blueprint will first outline the high-level steps, then break them down into smaller, iterative chunks, and finally refine those chunks into right-sized tasks for implementation. Once we have the steps finalized, we’ll generate a series of code-generation prompts for an LLM, ensuring each step builds on the previous work.

🔹 Step 1: High-Level Blueprint

This project involves extracting statistical data from research papers (PDFs) and structuring it into JSON. The high-level implementation plan is as follows:

Phase 1: Setup & Core Infrastructure

- Set up the project structure, dependencies, and environment.

- Implement a basic command-line interface (CLI) to accept a PDF path as input.

- Extract raw text from a PDF using PyMuPDF (fitz).

Phase 2: Text Processing & Segmentation

- Clean and preprocess extracted text (remove artifacts, segment sections).

- Identify key sections (e.g., “Methods”, “Results”) to improve extraction accuracy.

Phase 3: Statistical Data Extraction

- Implement regex-based parsing for common statistical values (p-values, confidence intervals, sample sizes, etc.).

- Extract and structure statistical test results into JSON.

- Implement linking logic to associate related statistical information (e.g., proximity-based matching).

Phase 4: Handling Edge Cases & Enhancements

- Extract tables (if possible) and convert them into structured data.

- Implement a completeness analysis to identify missing key data.

- Add logging for debugging and monitoring.

Phase 5: Testing & Refinement

- Implement unit tests using mock data.

- Validate JSON output structure and correctness.

- Refine error handling and exit codes.

Phase 6: Finalization & Optimization

- Optimize performance (text processing and regex efficiency).

- Final testing and cleanup.

🔹 Step 2: Breaking Down High-Level Steps Into Iterative Chunks

Now, we take these high-level tasks and break them down into small, incremental steps that build on each other in a logical sequence.

1️⃣ Setup & Core Infrastructure

- 1.1 Initialize a new Python project with a virtual environment.

- 1.2 Install dependencies (PyMuPDF, regex, json).

- 1.3 Implement a command-line interface (CLI) to accept a single PDF file as input.

- 1.4 Extract text from the PDF and print it to the console.

2️⃣ Text Processing & Segmentation

- 2.1 Remove unwanted characters and artifacts from raw extracted text.

- 2.2 Segment text into sections using heuristics (e.g., “Methods”, “Results”).

3️⃣ Statistical Data Extraction

- 3.1 Identify patterns for extracting numerical statistical values (p-values, confidence intervals, etc.).

- 3.2 Implement regex patterns to extract individual values.

- 3.3 Extract full statistical test details, including test type, effect sizes, and parameters.

- 3.4 Structure the extracted results into a hierarchical JSON format.

4️⃣ Handling Edge Cases & Enhancements

- 4.1 Implement linking of statistical data that spans multiple sentences.

- 4.2 Implement table extraction (if possible) and integrate it into JSON.

- 4.3 Implement a completeness analysis to detect missing key information.

- 4.4 Implement logging for debugging and error tracking.

5️⃣ Testing & Refinement

- 5.1 Write unit tests for text extraction.

- 5.2 Write tests for regex-based statistical value extraction.

- 5.3 Write tests for JSON output format validation.

- 5.4 Implement structured error handling and exit codes.

🔹 Step 3: Refining Into Code-Generation Prompts

Each step now needs to be transformed into a series of prompts that ensure progressive, modular development. These prompts will guide an LLM to generate clean, structured, and incremental Python code.

Prompt 1: Setup the Project

# Task: Initialize a Python project with a virtual environment and install dependencies.

Create a Python project with the following structure:

project_root/

│── extract_stats.py # Main script

│── requirements.txt # Dependencies

│── tests/ # Unit tests

└── logs/ # Log files

1. 1. Use Python 3.
  2. Create a virtual environment and install the necessary dependencies:

– PyMuPDF (for PDF text extraction)

– regex (for statistical value parsing)

– json (for output formatting)

1. 1. Implement a basic CLI that accepts a PDF file path as input.

Write Python code to set up this structure and ensure the CLI accepts a file path.

Prompt 2: Extract Text from a PDF

# Task: Extract text from a given PDF file.

1. 1. Use PyMuPDF (`fitz`) to extract raw text from a PDF file.
  2. Print the extracted text to the console.
  3. Handle errors gracefully (e.g., file not found, invalid PDF).
  4. Ensure the function is modular, taking a file path as an argument.

Write a Python function to accomplish this.

Prompt 3: Preprocess and Clean Extracted Text

# Task: Clean extracted text by removing unwanted artifacts.

1. 1. Remove extra spaces, line breaks, and non-text artifacts.
  2. Normalize encoding issues.
  3. Implement a function that takes raw text as input and returns cleaned text.

Write a Python function to preprocess and clean the extracted text.

Prompt 4: Segment Text into Sections

# Task: Identify and segment key sections such as “Methods” and “Results”.

1. 1. Detect section headings heuristically (e.g., “METHODS”, “RESULTS”).
  2. Split the document into segments based on these headings.
  3. Return a dictionary where keys are section names and values are section text.

Write a Python function to accomplish this.

Prompt 5: Extract P-Values Using Regex

# Task: Extract p-values from segmented text using regular expressions.

1. 1. Identify common patterns for p-values (e.g., “p = 0.05”, “p < 0.01”).
  2. Use regex to find occurrences in the text.
  3. Return a structured dictionary containing extracted p-values.

Write a Python function to extract p-values using regex.

Prompt 6: Extract Other Statistical Values

# Task: Extract confidence intervals, sample sizes, and effect sizes.

1. 1. Define regex patterns for confidence intervals (e.g., “95% CI [1.2, 2.3]”).
  2. Define regex patterns for sample sizes (e.g., “N = 100”).
  3. Define regex patterns for effect sizes (e.g., “Cohen’s d = 0.5”).
  4. Return structured data for each extracted statistical component.

Write Python code to implement this.

Prompt 7: Structure Extracted Data in JSON

# Task: Convert extracted statistical data into a structured JSON format.

1. 1. Define a JSON schema for extracted statistics.
  2. Populate the JSON structure with extracted values.
  3. Save the output to a file with a `.json` extension.

Write a Python function that takes extracted data and outputs structured JSON.

Final Prompt: Wire Everything Together

# Task: Integrate all components into a functional pipeline.

1. 1. Accept a PDF file as input via CLI.
  2. Extract and clean text from the PDF.
  3. Segment text into sections.
  4. Extract statistical values (p-values, confidence intervals, sample sizes, etc.).
  5. Structure extracted data into JSON.
  6. Save the JSON file.
  7. Implement logging.

Write Python code that integrates all these steps into a single script.

This approach ensures a progressive build-up of functionality, making it easy to implement, test, and refine each component before integrating them into the final application. 🚀 Let me know if you’d like to refine anything further!

Self Prompting Cycle

While I provided the initial synopsis and answered dozens of questions, I did not write the prompts that the LLM will use to generate the code, the model did that for itself.

As discussed earlier, LLMs can incorporate their previous outputs as part of the context for new responses. This makes it possible, and surprisingly effective, to ask a model to generate instructions (i.e., prompts) for itself. While it may seem a bit circular, this self-prompting loop is a useful technique for producing better results.

This is related to a broader concept called Chain of Thought prompting, where the model is encouraged to reason through a problem in a structured, step-by-step way. Because LLMs are quite good at creating those structured sequences, it often makes sense to have the model go through multiple cycles of generation and self-consumption. One output becomes the input for the next stage. Prompt → blueprint → prompt → code.

This isn’t a gimmick—it’s a practical way to leverage the model’s strengths. Good prompting remains key to good output, but you don’t always have to craft the perfect prompt yourself. Let the model help. Ask it to create prompts, iterate on them, and refine the results until the output meets your needs.

It’s not cheating—it’s using the tool to the best of its ability.

Reviewing the Work

Whether written by an LLM or a human, reviewing the output (specifications, blueprints, prompts) is just as time-consuming and just as essential. The review process is your chance to catch flaws before they’re baked into the code.

That said, the types of mistakes differ between humans and models.

Humans tend to maintain internal consistency. Once they understand a concept and record it, they rarely “forget” it later in the process and drop it from their documents.
LLMs, on the other hand, may drop previously included details without warning—even if they’re still within the context window.
Humans are also more likely to grasp the overarching goals of a project. If a solution doesn’t quite match the intended outcome, a human is more likely to flag it.
LLMs often follow instructions exactly—like a disengaged employee doing the minimum required, without pausing to consider whether the overall result actually makes sense.

While LLMs can sometimes surface surprisingly deep insights, this behavior isn’t reliable. I haven’t yet found a consistent way to prompt for “insightfulness.” You can make a model more creative or more random, but that’s not the same thing.

Human in The Loop

Given the value of self-prompting, you might wonder: why not have the model review its own work as well? In fact, this can help. You can even bring in a second LLM to perform the review, comparing outputs step by step.

But at the end of the day, you still need a human in the loop.

LLMs don’t reason in the way we do. They don’t truly “understand” and won’t reliably identify logical gaps, misalignments, or subtle inconsistencies. If a hallucination slips into the context early on, the model may carry it forward, treating it as fact without question. It can even “double down,” reinforcing its earlier mistake.

One of my favorite examples: ask a model to generate an image without elephants. The very act of mentioning elephants often results in—you guessed it—elephants in the image. Just by naming it, you’ve planted the seed. This issue is improving, but it’s not gone yet.

End of Part 2

In the next installment, we’ll finally reach the code generation step. This is where we see how well ChatGPT and Claude interpret all the inputs so far, and how close they come to generating working, usable software.