Chapter 11

Large Language Models (LLMs)

Large language models (LLMs) are machine learning models that are specialized for natural language processing problems, like language generation. Consider the autocomplete feature on your mobile device’s keyboard (figure 11.1). When you start typing “Hey, what are…”, the keyboard likely predicts that the next word is “you”, “we”, or “the”, because these are the most common next words after the phrase. It makes this choice by scanning a table of probabilities that was trained on commonly available pieces of content - this simple table is a language model.

The difference between using a Q-table and ANN for the parking-lot problem

Suppose we have the following short sentences from the book Alice’s Adventures in Wonderland.

Alice was beginning to get very tired
Alice was getting very sleepy
The white rabbit was late
The white rabbit was very late
Alice and the rabbit were friends

From simply reading these sentences, you might already recognize some patterns, like “Alice” and “The white rabbit” are always the nouns, “was” normally comes after “Alice” and after “The white rabbit”, the adjective “very” is used a lot (figure 11.3). It’s normal for us because our brains are pattern matching machines.

The difference between using a Q-table and ANN for the parking-lot problem

A simple concept in language models is Bigrams. Bigrams are 2-word “sentences”, for example “Alice was” or “white rabbit”. If we consider the first word as the word we’re making a prediction about, and the next word as a possible prediction, we can build a table of probabilities based on the sentences that we have available to train on. Probabilities are calculated using the formula in figure 11.4. The count for a specific word is the total occurrences of a word (Next) after the word (Previous ) divided by the sum of all occurrences of the word (Previous). This Next and Previous pairing is a Bigram. Think of this like your phone’s autocomplete. If you type “Alice”, the model looks at its training data to see what word usually comes next. If “was” appears 90 times after “Alice”, and “ran” only appears 10 times, the model assigns a 90% probability to “was” and suggests it as the next word.

The difference between using a Q-table and ANN for the parking-lot problem

//table

Now, using this table of probabilities, we can make predictions. Given the word “alice”, the model has two options for the next word: “was” with a probability of 0.67, and “and” with a probability of 0.33. This means that the next word predicted will be “was” with the probability of 67%, so the result is “alice was” (figure 11.5). Given the word “the”, the options are “white” with a probability of 0 67 and “rabbit” with a probability of 0.33. So, the completion will be “the white” with the probability of 67%, then for the next word, “white” has only one option with a probability of 100%, “rabbit”. Progressing to the sentence “the white rabbit”.

The difference between using a Q-table and ANN for the parking-lot problem

You can see how this tiny example of just 5 sentences and 29 “tokens”, can start producing a semblance of intelligent behavior that’s expected from a language model.

Play with the toy language model below to see how it learns the most likely next word based on the previous word.

Speed 5ms

Test a tiny language model

Start with

Phrase 0 / 0

Step 0 / 0

Unique pairs 0

0% complete

Bigram table

Word	Next Word	Occurrence Count	Probability

If we didn’t use ANNs to train the LLM, we would need to come up with the rules and relationships between all the tokens in the vocabulary. Even in the simple example that we’re using, this would require immense effort.

A powerful architecture that is used to train LLMs is called the Transformer. The Transformer architecture leverages artificial neural networks, but has a number of steps before that to make text data efficient to be processed, while maintaining information about the original data, like position of tokens. Furthermore, Transformers have a step called Self-attention that is extremely beneficial in helping tokens understand language semantics before further “meaning” can be found. Figure 11.24 illustrates the components of a Transformer.

Encoder: This encompasses the tokenization and vectorization process. It also creates a data object that is suitable for self-attention and ANN processing by creating a numeric embedding of tokens and their positional data in relation to each other.
Self-attention: This step lets all tokens figure out how they relate to all other tokens with a specific perspective in mind—for example, how tokens relate to other tokens that are nouns.
Decoder: This encompasses the ANN that intends to derive meaning after the Self-attention step, and “decoding” back to human-understandable outcomes, resulting in a probability distribution of predictions for the next token in the sequence.

The difference between using a Q-table and ANN for the parking-lot problem

Learn how LLMs work and are implemented in Grokking AI Algorithms, 2nd Edition.

Reinforcement learning Generative Image Models