Generative AI — A jargon-free explanation of how AI large language models work Want to really understand large language models? Heres a gentle primer.
Timothy B. Lee and Sean Trott – Jul 31, 2023 11:00 am UTC EnlargeAurich Lawson / Ars Technica. reader comments 30 with
When ChatGPT was introduced last fall, it sent shockwaves through the technology industry and the larger world. Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didnt realize how powerful they had become.
Today, almost everyone has heard about LLMs, and tens of millions of people have tried them out. But not very many people understand how they work.
If you know anything about this subject, youve probably heard that LLMs are trained to predict the next word and that they require huge amounts of text to do this. But that tends to be where the explanation stops. The details of how they predict the next word is often treated as a deep mystery.
One reason for this is the unusual way these systems were developed. Conventional software is created by human programmers, who give computers explicit, step-by-step instructions. By contrast, ChatGPT is built on a neural network that was trained using billions of words of ordinary language.
As a result, no one on Earth fully understands the inner workings of LLMs. Researchers are working to gain a better understanding, but this is a slow process that will take yearsperhaps decadesto complete. Advertisement
Still, theres a lot that experts do understand about how these systems work. The goal of this article is to make a lot of this knowledge accessible to a broad audience. Well aim to explain whats known about the inner workings of these models without resorting to technical jargon or advanced math.
Well start by explaining word vectors, the surprising way language models represent and reason about language. Then well dive deep into the transformer, the basic building block for systems like ChatGPT. Finally, well explain how these models are trained and explore why good performance requires such phenomenally large quantities of data. Word vectors
To understand how language models work, you first need to understand how they represent words. Humans represent English words with a sequence of letters, like C-A-T for “cat.” Language models use a long list of numbers called a “word vector.” For example, heres one way to represent cat as a vector:
[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, , 0.0002]
(The full vector is 300 numbers longto see it all, click here and then click show the raw vector.)
Why use such a baroque notation? Heres an analogy. Washington, DC, is located at 38.9 degrees north and 77 degrees west. We can represent this using a vector notation: Washington, DC, is at [38.9, 77] New York is at [40.7, 74] London is at [51.5, 0.1] Paris is at [48.9, -2.4]
This is useful for reasoning about spatial relationships. You can tell New York is close to Washington, DC, because 38.9 is close to 40.7 and 77 is close to 74. By the same token, Paris is close to London. But Paris is far from Washington, DC. Page: 1 2 3 4 5 6 7 8 9 Next → reader comments 30 with Timothy B. Lee Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC. Advertisement Channel Ars Technica ← Previous story Related Stories Today on Ars