For parts of this blog post, I have somewhat of a special guest author: OpenAI’s Generative Pre-trained Transformer 3, known as GPT-3. GPT-3 is a pre-trained language model that produces very good results for a number of scenarios involving the generation of human-like text. OpenAI has opened up the API for this product for testing and basic applications, so I’ve used it to generate parts of this post.
Firstly, the GPT-3 system is pre-trained with hundreds of billions of words in English, French, and many programming languages, so no additional training is required to perform a number of out-of-the box operations. For instance, it can summarize text, function as a chat bot, answer simple questions, translate English to French, and generate text and code based on descriptions. In many applications, it can function like a chatbot or just a “bot”, responding to input in a life-like fashion.
However, it is currently not specialized, and the API does not provide a way to extensively train the system in a particular domain. By this, I mean that you cannot significantly extend the knowledge base, and the results generated by the system may be more general than you might desire.
More specifically, if you try to enter a large amount of data on a topic, the system will respond with the error message: The current token limit for the model is 2048 (approximately 1000 words). Please reduce the size of your prompt and try again. Put another way, if the knowledge built into GPT-3 is sufficient for your needs, you can get some fascinating results, but if it does not already have good knowledge in the area you are interested in, your results may be somewhat disappointing.
The deep learning neural network behind GPT-3 is trained to perform a sophisticated sort of mapping. It does a fantastic job of learning correlations between words and phrases, but it does not have the context and insight you might need in the same way humans do. This means that in some situations a network like GPT-3 is more capable than a human, but in other cases it will make (to a human) obvious errors and generate nonsensical output.
Testing the System
In the next couple of sections, I show some actual examples of tests I performed with GPT-3. The basic usage of the system is simple: You choose an application category, and provide some sample text in a fairly deterministic format. GPT-3 then responds with text fragments (not necessarily full sentences) that the system determines are matching responses to your input. However, if your input is poorly formatted or worded,GPT-3 may not pick up the cues necessary to generate a reasonable reply. I discuss this in more depth after the following examples.
Category: Question and Answer
Note that the input text I used follows the format suggested by OpenAI for the category, and is shown in bold. The system responses are shown in red italics. Comments on the answers, which were not entered as part of the test, are shown in regular text. Note, some output results have been reformatted to look better here, but no text was changed.
Also note how, for the Question and Answer functionality, I actually provide a description of the system and both an initial question and answer. I did this to prompt it to respond in a similar fashion. This is how the GPT-3 creators suggest that their system should be primed.
I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with “Unknown”.
Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.
Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was the president of the United States in 1955.
Q: Who was president of the United States in 1655?
A: John Pierce was president of the United States in 1655.
A fascinating example of a strange answer to a silly question.
Q: Who was president of the United States in 1841?
A: Martin Van Buren was the eighth President of the United States. He served from March 4, 1837 to March 3, 1841.
Actually, William Henry Harrison took office on March 4th of 1841, but died only 31 days later. He was succeeded by his Vice President, John Tyler. So three people were president in 1841.
I admit that, while I knew Harrison did not spend much time in office, I did not remember the exact year, dates, and names of all the presidents for this question. You could argue that GPT-3 answered it as well or better than most humans would. And yet a common expectation is for AI technology like this to have full and complete information for any picayune piece of data. This example shows that this is simply not the case.
Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.
Q: What is human life expectancy on Earth?
A: Human life expectancy on Earth is 78 years.
Hmmmm, Google says that the life expectancy on Earth is closer to 73. I tend to believe Google in this case.
Q: What is the human life expectancy on Mars?
A: Human life expectancy on Mars is unknown.
Q: What is human life expectancy on Mars?
error: “Completion may contain sensitive content “
The removal of a single word, “the”, in this last example led to the system determining that the answer might be controversial. The folks at OpenAI are trying to avoid a repeat of the scenario where biased inputs to a bot might result in inappropriate outputs, as found with the Microsoft “Tay” bot experience. While this is a good idea in general, I’m not sure how the determination was made in this situation.
Category: TL;DR Summarization
Now let me showcase another capability: GPT-3’s ability to summarize paragraphs. Like the rest of its functionality, this is based on a kind of sophisticated pattern-matching algorithm implemented by the neural network and one hell of a lot of training. For these tests, I borrowed some paragraphs from some of my previous posts. The results are good—see for yourself!
From “Hiring Great Engineers”, paragraph 2
Every hiring manager has certain areas they look at when appraising engineering candidates. For software engineers, I have, over time, separated my evaluation criteria into three general categories: technical skills, cultural fit, and a passion for engineering. Technical skills is the most obvious category, and often the easiest to assess, as these tend to be concrete (hard skills) that you can verify. Cultural fit is harder to define and can be much more difficult to gauge. The goal should be to find great teammates, not necessarily clones of the existing folks. My third criterion, the one I find to be the differentiator between good and great engineers at any level, is a real passion for engineering.
tl;dr: Passion for engineering is the differentiator between good and great engineers.
From “Faster Teams”, paragraph 5:
To effectively leverage additional staffing requires clear goals, strong communication, and more effort by management. You need to know what the additional folks are expected to accomplish (their goals), and how the work can effectively be shared among them. More people always means more communication, both spoken and written, to ensure that everyone stays in sync. All engineers, whether permanent staff or project-specific contractors, require clear guidance, support, timely feedback, and a high level of organization.
tl;dr: You can’t just throw more people at a project and expect it to be done faster.
The Entire “Faster Teams” Post:
The current token limit for the model is 2048 (approximately 1000 words). Please reduce the size of your prompt and try again.
This demonstrates that the current system has limits, at least for external users such as myself, as to how much input can be realistically handled.
The Final Section of “Faster Teams” Post:
Despite the fact that I have repeatedly been asked about making software organizations “faster,” I believe that the deeper question is how to deliver the right functionality in a timely manner. There are many possible ways to answer this question. Resist the pressure to spread your efforts (and your team’s efforts) across too many areas at once. As Steve Jobs and others have said: “Deciding what not to do is as important as deciding what to do.”
Do not attempt to hire your way to better delivery until you have assessed and improved the current situation, or you may further compound existing problems. Start by working with your engineering team to improve the relevant people, processes, tools, and techniques. But don’t stop there. Work closely with your PM team so that the guidance they offer is clear, accurate, and well targeted. Ensure that your company (or organization) has a workable process for goal setting, alignment with company goals, and prioritization of engineering efforts.
Of all the possible areas to which you can devote effort, you need to figure out which are most important for your situation, not which appear to be the most urgent. (This is known as the Eisenhower Principle.)
Many executives ask about making software organizations “faster,” but this question often masks the real problem, which is reliably delivering the right functionality. Even if you investigate the engineering organization’s challenges and then spend the time to address the most important of these, you still may not make the engineers any “faster,” but you will solve some of the real problems.
tl;dr: The question of how to make software organizations “faster” often masks the real problem, which is reliably delivering the right functionality.
A Powerful System
I’ve provided real input/output examples, including some successes and failures, from two application categories. Currently, OpenAI provides a total of forty categories, including, but not limited to: grammar correction, Tweet classification (sentiment analysis), automatic spreadsheet and code generation from text descriptions, story creation, ESRB rating generator, restaurant reviews, interview question generator, and first-to-third person conversion of text. The technology is powerful, flexible, and has a wide range of applications. Search the internet for GPT-3 to see far more examples than the ones I created for my explorations.
Is it Conscious?
Does this mean that GPT-3 is close to becoming a thinking, conscious, intelligent system? Popular science articles describing technologies like GPT-3 tend to sensationalize them to provoke a reaction and to get more readers. This leaves folks with the impression that GPT-3 could develop an omnipotent genius, whereby it would be able to comprehend any text and converse intelligently about it, and eventually to it developing agency and taking over the world, as conceived of in a Technological singularity. For a movie analogy, consider (spoiler alert) Skynet, the computer that becomes aware, grows in intelligence, and attempts to destroy the world in the Terminator series of movies.
But GPT-3 does not and cannot think the way a human thinks. It is not quietly contemplating or plotting anything. It does not have motivations and goals. It has no agency. It does not train itself and seek to independently gain intelligence. All of the context and intelligence it possesses have been provided by the vast amount of training it has undergone, and the small samples you have provided it with.
It does not understand silly, or often even wrong inputs. And it can be inconsistent.
I’ve included some examples to show that the technology behind GPT-3 can produce excellent results in some scenarios, and others to demonstrate that it really lacks context or insight, and does not actually comprehend the text in the same way a human might.
The Turing Test
In some scenarios GPT-3 can produce text that is indistinguishable from how a human would respond. So even if it is not truly conscious or thinking, could GPT-3 simulate humanness? Indeed, this is the criteria for the Turing Test, devised by Alan Turing in 1950: Can a human determine whether or not they are conversing with a human or a computer? In the original Blade Runner movie, they alluded to this test (but not by name) in a number of scenes, including the replicant testing in Blade Runner. This scene on YouTube may be familiar to you.
GPT-3 would not pass a full Turing Test at this time, as demonstrated by some of the errors and nonsensical answers that are generated in various scenarios. But the quality of some of its replies shows that it is getting closer.
Quite a few folks have considered this same topic, including in these two examples:
So What Does This All Mean?
Undoubtedly, future versions of GPT-3 (GPT-4?) and other similar technology products will continue to get smarter, faster, and generate better results. Many limitations I demonstrated could be overcome by specific training. For instance, the algorithm could be tweaked to recognize nonsense, or given more context on particular areas, such as life expectancy or US presidents. But fundamentally, this is a highly sophisticated pattern mapping algorithm, likely to have some of the same sorts of flaws, if not identical to the ones that I demonstrated.
This technology is a powerful tool, no doubt about it, but it is not magic and not alive. Based upon the wide range of application categories that GPT-3 is currently being tested with, it is clear that it and other similar technologies like it will find many practical and creative uses in the near future. You should be impressed with the technology, but not worried by it. If your job consists of creating tl;dr sections for paragraphs, you will most likely need to look for a different line of work. In fact, if that is your job, you should already be looking.