Chapter 12

Generative Image Models

Generative image models are a family of algorithms and artificial neural network structures that are specialized towards generating accurate images based on human language input. Imagine a sculptor who has spent his life observing people who are in deep thought. For years, he walks around the town and thoroughly studies every aspect of every person that he sees in thinking deeply - carefully studying their posture, expressions, and subtle details. Over time, he internalizes what it means to look like someone thinking. He might have seen hundreds of thousands of people pass through the town over the years. We blindfold the sculptor and give him a random block of marble, and ask “Make me a sculpture of a person thinking”. The sculptor can’t add new material to marble; instead, he feels the block of marble and chips away a small piece that he’s confident does not look like a person thinking. He repeats this thousands of times, each time feeling the edges of the marble, and chipping away a little more “noise”. Slowly and methodically, a coherent image of a person thinking emerges from within the random block of stone.

The difference between using a Q-table and ANN for the parking-lot problem

In the same way, a generative image model, trained on millions of images, begins with random noise and iteratively transforms it into a clear image. Not by adding to a blank canvas, but by learning how to remove what doesn’t belong.

The most popular architectures and approaches for modern image generation is Diffusion.

Let’s assume that we want to generate an image of a tree, and we start with the random unclear image.

The difference between using a Q-table and ANN for the parking-lot problem

What do you see? This is the starting point for the model: a canvas of pure, random noise. To our eyes, it’s a meaningless, blurry blob. To the model, it is a field of potential, containing every possible image in a faint, ghostly form. At this stage, the model’s job is to take its first look and decide which parts of this noise are the least likely to belong in making the image look like a tree.

The difference between using a Q-table and ANN for the parking-lot problem

After several steps of “denoising”, a faint structure begins to emerge (figure 12.4). What do you see now? Perhaps a head of broccoli? An explosion? A tree? It’s still very ambiguous, but a general shape is taking form. The model has peeled away the most obvious layers of noise, revealing a low-resolution silhouette. It’s beginning to commit to a general structure, but the details are still fluid. Can you identify which parts of the image need to be denoised to see the tree better?

The difference between using a Q-table and ANN for the parking-lot problem

After many more refinement steps, the image becomes much clearer (figure 12.5). It’s almost certainly a tree! The main trunk and the leafy canopy are well-defined. The model has now locked-in the high-level concept. Its task is no longer about figuring out the overall structure, but rather, about refining the details, carving out the smaller branches, and adding texture to the leaves. Can you identify the areas that need to be focused on to reveal more details about the tree?

The difference between using a Q-table and ANN for the parking-lot problem

Finally, after hundreds of steps, we arrive at the final, crisp image (figure 12.6). We can clearly see the tree, complete with detailed bark, individual leaf clusters, and a coherent structure. The model has successfully removed all the noise that was inconsistent with its goal. This reveals that the tree was hidden within the initial random noise. The journey from a noisy blob to a sharp final image is the essence of the diffusion process. This concept likely seems counterintuitive. When we think of image generation, we might think that it’s about drawing and painting because this is how humans create images, but it’s the inverse: starting with noise and iteratively removing the noise that doesn’t shouldn’t be there.

Play with the learning rate and noise (beta) parameters to see how they affect the image generation process.

Learning rate 0.02 Noise (beta) 0.60 Speed 120ms

Choose target

Clean target

Noisy input

Denoised output

Steps —

Average loss —

Loss

The U-Net is the state-of-the-art architecture that powers virtually all modern diffusion models, and it’s the one that we will build in this chapter. The U-Net is a special type of CNN designed specifically for image-to-image tasks. Its key innovation is a “U-shaped” design with three core components:

Encoder (Down-sample path): Think of this path as a summarizer. It progressively shrinks the image using convolutional layers, forcing the network to move beyond individual pixels and capture the high-level context and generalize better. As the image gets smaller, the network gets better at understanding what is in the image, like “this is a face”, but loses the precise information about where the fine details are located. This process creates a rich but low-resolution summary of the image’s abstract meaning.
Decoder (Up-sample path): See this path as an artist tasked with reconstructing the full-resolution image from an abstract summary created by the down-sample path. It progressively up-samples the feature maps using transposed convolutions, taking the high-level concepts, like “a face”, and attempting to add the fine details back in. By itself, it would struggle to perfectly position every edge and texture because much of the precise spatial information was lost during down-sampling.
Skip Connections: This is the U-Net’s super power and the solution to the up-sampling path’s problem. A skip connection is a shortcut. It takes the high-resolution feature map from an early stage of the down-sample path, which is rich in fine, spatial details, and feeds it directly to the corresponding layer in the up-sample path. This allows the decoder to combine the abstract, “what” information from the deep layers with the precise, “where” information from the early layers, making it exceptionally good at reconstructing images with high fidelity.
The bridge (Also known as the Bottleneck): This is the lowest point in the U-shaped path, connecting the end of the down-sample path to the start of the up-sample path. It processes the most compressed, abstract representation of the image. In a diffusion model, this is the critical stage where the timestep and text embeddings are injected, combining the model’s understanding of the image with the guidance of the text label before reconstruction begins.

The difference between using a Q-table and ANN for the parking-lot problem

The unique structure of the U-Net makes it the perfect tool for our diffusion model. It can look at a noisy image, understand the high-level context and use the fine-grained skip connections to precisely predict the noise in every pixel.

Learn the details behind U-Nets and Diffusion models in Grokking AI Algorithms, 2nd Edition.

Large Language Models (LLMs)