Purpose of these notes

I routinely come across applications for large language models (LLMs) and want to stay grounded in what they are, how they should be used, and their limitations. Many kinds of artificial intelligence (AI) systems can appear similar on the surface but are based on very different underlying technologies. LLMs can seem almost magical in what they can do, a result of generating statistically likely responses to prompts based on patterns learned from large corpora of human-produced text. In this sense, the process is largely mechanical, and it can create an illusion of insight or creativity by producing new combinations and variations of what it learned during training, rather than demonstrating understanding or intent in the human sense. The following notes capture high-level concepts about LLMs for frequent review, and as a reminder that the process is fundamentally impersonal, even when its outputs can appear, at first glance, to reflect extraordinary insight or imagination.

Disclosure and attribution

Portions of this post were generated with the help of AI tools, including ChatGPT, as part of an initial draft. I then reviewed, edited, and reorganized that material into notes that are useful for my own reference. This post is shared publicly for convenience, and for anyone who may find the notes helpful. Any quoted or source-specific material remains the property of its respective owners, and this post is not intended to claim authorship or intellectual property ownership over AI-generated text or third-party content.

Large Language Models

How does a Large Language Model (LLM) work?

An LLM is a statistical model trained to predict and generate text. An LLM does not “understand” language in a human sense. Instead, it learns patterns in how words and symbols tend to appear together, and it uses those patterns to produce the next most likely piece of text given what came before.

At the foundation, text is converted into tokens. Tokens are small units such as words, word fragments, or punctuation marks. Each token is mapped to a numerical vector so it can be processed mathematically. The model never sees raw text directly, only these numbers.

Most modern LLMs are based on the transformer architecture. The core idea of a transformer is self-attention. For every token in a sequence, the model computes how strongly it should attend to every other token. This allows the model to weigh context, for example recognizing that the word “it” refers to a specific noun earlier in the sentence, even if many words intervene.

An LLM is trained on very large collections of text. During training, the model repeatedly tries to predict the next token in a sequence and compares its prediction to the actual token that followed in the training data. The difference between the prediction and the correct answer is converted into a numerical error, and the model’s internal parameters are adjusted to reduce that error. After many iterations, the model becomes very good at predicting what text is likely to come next in many contexts.

When you interact with an LLM, it generates responses – one token at a time. Given your prompt, it calculates a probability distribution over possible next tokens and selects one based on that distribution, often with some randomness to avoid repetitive or overly rigid output. That chosen token is then added to the context, and the process repeats until the response is complete.

It is important to note what LLMs do not do. They do not retrieve facts from a database in the way a search engine does unless explicitly connected to external tools. They do not reason symbolically like a traditional program, and they do not have beliefs, goals, or awareness. Their apparent reasoning emerges from learned statistical structure rather than explicit logic.

In practical terms, LLMs are powerful because language itself encodes a great deal of human knowledge and reasoning patterns. By learning those patterns at scale, the models can summarize, translate, explain, and generate text that often appears coherent and purposeful, even though it is produced entirely through probabilistic prediction.

The prompt, it’s tokens, and each token vector

Each token is mapped to a numerical vector. How does a token become a vector? What is the relationship between the original prompt, it’s tokens, and each tokens vector?

A token becomes a vector through a learned lookup and transformation process. The key idea is that the original prompt, its tokens, and the vectors are three different representations of the same information, moving from human-readable text to mathematical form.

The process starts with the original prompt as plain text. This text is passed through a tokenizer, which splits it into tokens according to fixed rules learned during model training. A single word may be one token, multiple tokens, or part of a token, depending on how common it is. For example, “engineering” might be one token, while a rare technical term might be split into several pieces. At this stage, the prompt is a sequence of token IDs, each ID being an integer that uniquely identifies a token in the model’s vocabulary.

Each token ID is then mapped to a vector using an embedding table. Conceptually, this table is just a large matrix learned during training. Each row corresponds to one token in the vocabulary, and each row contains a vector of fixed length, such as 768 or 4,096 numbers. Looking up a token’s vector is not computed from the letters at runtime; it is a direct lookup. If token 12345 corresponds to “engine,” the model retrieves row 12345 from the embedding matrix and uses that vector.

These vectors are called embeddings. Their values have no direct human-interpretable meaning. Instead, their meaning comes from geometry. During training, the model adjusts the embedding vectors so that tokens used in similar contexts end up closer together in the vector space. For example, vectors for “aircraft” and “airplane” will tend to be closer to each other than to “banana,” because they appear in similar linguistic environments.

Position also matters. Transformers do not inherently know token order, so positional information is added to each token’s embedding. This is done by adding a positional encoding vector to the token’s embedding vector. The result is a combined vector that represents both what the token is and where it appears in the sequence.

At this point, the original prompt has become a sequence of vectors, one per token, each encoding token identity plus position. This sequence of vectors is what flows through the transformer layers. Self-attention and other operations transform these vectors repeatedly, mixing information across tokens and refining their representations based on context.

In summary, the relationship is hierarchical and deterministic. The prompt is split into tokens. Each token is converted to an integer ID. Each ID indexes a learned vector in an embedding table. That vector, augmented with positional information, is the mathematical representation the model uses to reason about and generate language.

Transformer model “attention”

For every token in a sequence, the model computes how strongly it should attend to every other token. What does the word “attend” mean in this context?

In this context, “attend” means “how much information from one token should influence another token’s representation.” It is not attention in the human sense of focus or awareness. It is a precise mathematical weighting.

Concretely, each token’s vector is linearly transformed into three different vectors called a query, a key, and a value. These are learned projections, not separate data sources. For a given token, the query represents “what this token is looking for,” while the keys represent “what each token offers,” and the values represent the information that can be taken from each token.

To compute how much one token attends to another, the model takes the dot product of the query vector of the first token with the key vector of the second token. This produces a score. A higher score means the two tokens are more relevant to each other for the current computation. All such scores are scaled and passed through a softmax function, which turns them into weights that sum to one.

These weights are the attention values. They answer the question: when updating the representation of this token, how much should I incorporate information from each other token? The model then computes a weighted sum of the value vectors using these attention weights. That weighted sum becomes the new representation of the token for the next layer.

An important point is that this process is repeated independently for every token in the sequence. Each token has its own query, so each token produces its own pattern of attention over all other tokens. For example, a pronoun token may attend strongly to an earlier noun, while a verb token may attend more to its subject and object.

“Attend” therefore means assigning influence through learned weights. It is a mechanism for routing information across the sequence. Nothing is selected or ignored in a binary way; every token contributes something, but some tokens contribute far more than others.

In short, to “attend” is to mathematically weight how much one token’s vector contributes to another token’s updated vector during processing.

Training an LLM

How is an LLM trained?

At a high level, training an LLM is a process of teaching a neural network to predict the next token in text, then refining it until those predictions become consistently accurate across many contexts.

The process begins with a large corpus of text drawn from many sources. This text is cleaned and standardized, then converted into tokens using a fixed tokenizer. Each training example is a sequence of tokens, where the model is shown the first part of the sequence and asked to predict the next token.

The model makes its prediction by passing token embeddings through multiple transformer layers. At the output, it produces a probability distribution over the entire vocabulary, representing how likely each possible next token is. The correct next token from the training data is known, so the model’s prediction can be compared directly against it.

This comparison produces a numerical error, often called a loss. If the model assigns low probability to the correct token, the loss is high; if it assigns high probability, the loss is low. Training aims to minimize this loss across all examples.

To reduce the loss, the model uses backpropagation. This is a mathematical procedure that computes how much each internal parameter contributed to the error. The parameters, including token embeddings and attention weights, are then adjusted slightly in the direction that would have made the correct prediction more likely. This adjustment is done using gradient-based optimization and is repeated millions or billions of times.

Over time, the model internalizes statistical regularities of language. It learns grammar, common phrasing, factual associations, and even patterns that resemble reasoning, not because these were explicitly programmed, but because they help reduce prediction error across diverse text.

After this base training, models are often further refined. Additional stages may include fine-tuning on curated datasets, instruction-following data, or feedback from human evaluators. These steps do not change the core mechanism but shape the model’s behavior to be more useful, safer, and better aligned with user expectations.

In summary, training works by repeatedly asking the model to guess what comes next in text, measuring how wrong it is, and nudging its parameters so that future guesses are better. Everything the model appears to “know” emerges from this iterative prediction-and-correction process applied at very large scale.