Hello and welcome to this series of tutorials on autoregressive models! I decided to put this together to help me (and hopefully you) learn about these types of model which have cutting edge applications in image generation.

I want to go into depth on this because there are some core ideas I think it’s worth exploring and understanding; particularly as I find the papers are very thin on detail.

I have split this up into chunks

Part 1 : General introduction to generative autoregressive models and particularly the pixelCNN family of models.

Part 2 : Theory (Probability); an exploration/explanation of the key ideas that make these models work.

Part 3 : Theory (Causality); an exploration/explanation of the key ideas that make these models work.

Part 4 : Model architecture : I will show how these models are built by piecing together some simpler blocks.

Part 5 : Implementation and Results; I will have a go at implementing the pixelSNAIL architecture and producing some results.

What is an autoregressive, generative model?

The purpose of generative models is to be able to create new images according to a set of training examples. The generated samples should look like the training set but also be unique. For example new faces, new designs for objects…

Some example 32×32 images generated for the original pixelSNAIL paper

The samples above are generated after training on the cifar-10 dataset; I will use this dataset as well; along with the gemstone data from Kaggle

To generate a new sample image, the image is built pixel-by-pixel; with each new pixel having a value determined probabilisticly; based upon values of all past pixels. Here we see this process in action as one of my networks generates a “new” gemstone

We can also take a partially obscured image and let the model infer what should be in the blank area

We can take a partial image (middle) and let the model generate the rest of it. Comparing it to the real image (left) the model has performed very well on this task.

This kind of sequentially generated output built element-by-element is the territory of Recurrant Neural Networks (RNNs); which have seen incredible results in 1D sequential data (language or audio processing) among other things. They are able to effectively use ideas such as long term memory and attention to generate realistic prose and music.

However; Convolutional Neural Networks (CNNs) are the natural choice for many computer vision problems as they are built on an effective use of convolutions; which preserve translational invariance and allow scaling to very deep networks which can learn abstract features of objects in images at many scales.

It seems both of these fields are a legitimate starting point for image generation and indeed, many generative models lie at some sort of intersection; using ideas from both to create leading results.

In this lengthy tutorial; I will be exploring the pixelCNN family of algorithms (probabilistic autoregressive generative models based on CNNs with ideas borrowed from successful RNNs) from the ground up. I will focus mostly around the pixelCNN++ and pixelSNAIL architectures which have mostly superceded the older ones.

These models are;

  • generative : This is a type of unsupervised learning where the aim is to learn about a training set of data; and then generate new examples.
  • probabilistic : They focus around generating images by explicitly modelling a probability distribution and then sampling from it.
  • autoregressive : The model learns the probability distribution of the training data explicitly (other models may only indirectly involve the distribution, such as variational autoencoders). The model outputs logits (values in the full real number range) hence the use of the term regressive.
A sketch of how an autoregressive generative model works when used to generate new images based on a set of training data.

The major competing suite of generative model are Generative Adversarial Networks (GANs) which approach the problem fundamentally differently.

For many years; these led the way in terms of results; but recently probabilistic based generative models have come back into the fore (for example VQ-VAE 2.0). There are advantages and disadvantages of both. I aim to cover these in future work.

Brief History

Based on some moderate to light internet research I have detailed below the timeline of the pixelCNN family. The first use came from the pixelRNN paper in 2016; for which pixelCNN was one of several models tested.

This was followed up 5 months later with a paper just for pixelCNN with various improvements (called Gated PixelCNN). In early 2017 pixelCNN++ was produced; which essentially replaces the previous iterations with substantial improvements.

pixelSNAIL in late 2017 incorporated the ideas of attention mechanisms into the architecture; and made several other changes to the structure as we will discuss. Some of the authors also created pixelCNN++ so there are many similarities; and the two effectively share the same codebase.

We are discussing here the use of the architecture for image generation; for which they are out-competed in performance by GANs and VAEs so it may seem like they are condemned to history. However, although they may not be used in practice for image generation; they have other applications, for example they offer a good route to modelling priors in models like VQ-VAE. More on this another time.

There are other models which are neglected in this tutorial but which achieve competitive results such as Transformer and Block Sparse PixelCNN++. Below are links to all the papers.

How to Compare the quality of the models?

The metric of primary interest in the generation of images is the bits per dimensions (BPD) which is the average negative log-likelihood (NLL) over the dimensions of an image (and also divided by log(2)). You will find an implementation of this within the code for this tutorial. To understand this quantity skip to post 2 in the series.

The lower this number is the better; and for the standard data set used in comparison (cifar-10) pixelSNAIL achieves the leading score of 2.85 for this family of models. Because it is based directly on the probability distribution, the NLL (equivalently the BPD) represents a measure of the quality of the sample when generated which is a very nice feature of these autoregressive models.

We will now discuss aspects of the pixelCNN family of models in detail; we need to discuss the architecture (how the CNN is built to achieve its goals) including the idea of causality, the loss function which is optimised (the probabilistic nature of the model enters here) as well as the methodology for sampling new images.

Causality

One of the fundamental properties of a generative model is causality; which is the requirement that predictions made for a single element of a sequence (for example a pixel of an image, word of a sentence or note of a piece of music) only depends on previously generated elements and not future elements.

An example of causality being broken by a 1D convolution applied to a 1D sequence.

The key to making these networks work is that causality must be preserved throughout the network. This can be enforced in a number of ways; but will either have to modify the kernels and/or the input spatial feature maps to each layer. I have dedicated a section of these tutorials to a discussion of this topic (post 3).

The easiest way (in my opinion) is to modify the padding of the layers such that the filters cannot see into the future; this also requires resizing the kernel and at least on the first layer, and having two streams. Alternatively; the kernel can be masked, but even in this case the shifted padding is still necessary. But a single stream can be used.

Probabalistic Loss Function

The key to the use of pixelCNN type models for generating new samples is all in the loss; the loss contains all the ingredients that define the probabilistic nature of the model. The network will always just output a tensor of a given shape; the loss function will then chop it up and define the interpretation of each piece.

The interpretation of the logit outputs of the pixelSNAIL network; to be explained in detail in post 2.

Every probability distribution has an associated set of parameters; a Gaussian has mean and variance. These parameters are what we ask the network to learn.

Probabilistic models always have some sort of random nature. The network does not have any randomness in its outputs (except for the initialisation of weights) so it can never be used alone to produce new samples; since it would be deterministic and therefore predict the same image each time, depending only on what the input was, and choosing a random input would almost never yield a good image.

So, accompanying the loss is another function called the “sampling function” that takes the network output and uses these as parameters in a probability density function, from which we sample.

Making sure the loss function and sampling are consistent may be a headache and can easily lead to bugs that never generate good samples. I will provide some working code that can be used in any TensorFlow project.

The loss function is in units of bits per dimension which has been described above; the conversion from NLL to BPD is;

\begin{aligned} {\rm Log-Likelihood \ &: \ } – \mathcal{L}(p(x)) \\ {\rm Bits \ per \ dimension \ &: \ } \frac{- \mathcal{L}(p(x))}{\log(2) \cdot B \cdot H \cdot W \cdot C} \tag{1} \end{aligned}

The logarithm factor converts from continuous to discrete values and the rest is total size of one batch of training data (Batch, Height, Width and Channels). This value varies for different datasets but is generally good for values around 2.0-3.0 on, for example, the cifar-10 set.

The pixelSNAIL Architecture

We will cover the architecture for the pixelSNAIL model once we have gone through the theory behind causality and the loss function. The architecture itself varies considerably between implementations in code; so we approach this by considering the fundamental building blocks; and then how they can be combined.

The pixelSNAIL architecture for two variants that can be found on GitHub, this will be explained in detail in post 4.

Other architectures such as pixelCNN++ apply the same principles of causality and probability; but differ in the specific blocks used in the network.

Conclusions

This is the first post in a series covering the pixelCNN family of models; hopefully this has served as a good introduction to what they are, what they do and why we should care.

Of course; it may always make a bit more sense if you come back to it after reading through the whole series carefully.