Autoregressive Generative Models in depth : Part 5

Posted onMay 17, 2020 by t.w.jubb

Aaaaand… finally; after Parts 1-4 we are ready to implement and test the pixelSNAIL model. In the previous posts we went in detail through the theory and concepts behind the pixelCNN style models which were designed to generate unique new images after being trained on a dataset.

In this post I will go through some runs of the pixelSNAIL model; using a Keras implementation I built to port the original code from the largely deprecated TensorFlow 1 to TensorFlow 2.

You can find this repository here.

\begin{aligned} \end{aligned}

Initial Exploration with pixelSNAIL

The first task was to get pixelSNAIL working; this code is very difficult to read and also is written in TensorFlow 1 style (I think TensorFlow 1 tends towards difficult to read code). I got it working with version 1.15.2 of TensorFlow and also developed and streamlined version of train.py in Jupyter notebook format which works on a single GPU (it’s all I have and I found the multi-GPU code was a distraction).

You can find that notebook here (and I humbly recommend using it rather than the train.py of the original, due to several difficulties in acquiring the required dependencies).

I got a few results by training for a few hundred epochs with some changes to the overall network size (the number of GRBs per pixel block, and the number of pixels blocks so that total GRBs $= N_{\rm grb}\cdot N_{\rm pb}$; the number of feature maps or “filters” for each of the internal layers and whether or not the attention block is used)

	$N_{\rm grb}$	$N_{\rm pb}$	Attn.	Filters	Params	Batch	BPD	Time / epoch
Small	1	1	No	64	137k	128	3.44	$\sim 0.3$ min
Medium	2	1	No	128	1M	64	3.26	$\sim 1$ min
Large	2	2	Yes	128	2.5M	32	3.26	$\sim 5$ min
Paper	4	12	Yes	256	?	?	2.85	N/A

A couple of test runs of the original pixelSNAIL implementation with smaller models compared with the paper. This was for cifar-10 (60k training samples)

This code will plot samples of the model every so often during training (the sampling is costly so you don’t want to do this too frequently) and I’ve shown the progress through the model’s training below. Each image contains 121 samples from 10 epochs of training; so a total of 130 epochs by the end.

Not bad for such a tiny model (137k parameters); clearly it captures colours and shapes. What about the larger model?

Clearly a lot better; the images after around 70 epochs (the model has converged well by this point) are looking pretty good.

The training curves below look a little odd; with the test loss significantly lower than the train loss (this is possibly due to the Polyak averaging picking up the very first epoch which tends to just be junk)

Training curves for the three models on the cifar-10 dataset

I trained the larger (2.5M param) model on the gemstone dataset as well; as this may provide an easier task (there is more commonality between the images in the gemstone set). Indeed, the BPD settles to a lower value and the samples from the trained model look fairly impressive; although none of the straight edges seen in cut gemstones are replicated, most are just lumps. But lighting and shading added by the model gives the images a realistic feel.

(all these images are generated samples). The training curves look very odd, but eventually converge;

I tried to up the model size once again but ran into several problems with reliability; the code produces NaN values quite often. So, it’s time to move on from the original code; it’s just too tied to TensorFlow 1.

Keras Version

To convert from TF1 to TF2 (Keras) requires a rewrite of all the pieces of the network into Layers (blocks such as GRB can also be implemented as regular functions).

You can find that code here.

I tried to keep the architecture identical (not counting any mistakes); but switched from a manual implementation of weight normalization to the TensorFlow addons version. In the end the only piece that’s missing is the Polyak averaging. I had a go with another small/medium/large set (I ran for longer on the large model which is why it does better than the original TF1 version in the table above).

	$N_{\rm grb}$	$N_{\rm pb}$	Attn.	Filters	Params	Batch	BPD	Time / epoch
Small	1	1	No	64	149k	128	3.40	$\sim 0.3$ min
Medium	2	1	No	128	1M	64	3.28	$\sim 1.5$ min
Large	2	2	Yes	128	4.5M	32	3.18	$\sim 5$ min
Paper	4	12	Yes	256		?	2.85	N/A

A couple of test runs of my Keras pixelSNAIL implementation with smaller models compared with the paper. This was for cifar-10 (60k training samples)

The training curves look much more like normal, with training and test curves close together. This, and the extra noise in the curves are likely due to the lack of Polyak averaging.

On the gemstone dataset the results of a model with 2 pixel blocks, 4 RGB per pixel block, 128 filters and attention mechanism look really good after around 500 epochs when the BPD dips below 3.

This is a good but hand-waving way to validate the code. Nonetheless it is very difficult to match up like-for-like results.

I am going to leave it here; as I think that with the code available you can run your own tests. I will probably post more on my investigations into these architectures at a later date; but I am keen to move on to variational autoencoders for the next in-depth series.

Conclusions

This is the end of the five part series on autoregressive generative models; focusing on the pixelSNAIL architecture. I hope you have found at least some of it useful and I will be adding extra topics hopefully soon.

I used this topic to bridge my understanding from detection/segmentation algorithms into generative models; so I will be developing another series of articles on Variational Autoencoders (VAEs) next and eventually working my way towards Generative Adversarial Networks (GANs).

Initial Exploration with pixelSNAIL

Keras Version

Conclusions

Share this: