xNeurals
9 min readAug 30, 2021

--

Introduction to GANs part 2 — an example

In my last post, I gave a brief theoretical introduction to GANs in their original form[1]. Before moving forward to more advanced topics, it is useful to create a simple program to run our first GAN. For all my programming, I use Python with PyTorch. The architecture I am using for this example is a 4 layer MLP in the generator and the discriminator. Both networks use rectified linear unit activations [3] in all layers but the output layers. The output layers of both networks use sigmoid activations to map the outputs back to the interval [0,1]. The generator uses batch normalization [2] in the hidden layers as well. The code for this implementation is included at the bottom of this post.

While the network I am using is not very sophisticated, there is still much we can learn from it. For instance, we can quickly observe what occurs when we change hyperparameters of this network. As the network is given in the implementation below, we get the following outputs after 1,10,25 epochs of training.

We see that by at least 10 training epochs, the generator is able to produce decent outputs. Moreover, these samples are not hand picked and represent 100 random samples generated by the network. By 10 epochs of training there are definitely some generated samples which are not recognizable by a human as a number, but by 25 epochs almost all digits produced by the generator are recognizable by a human.

With this as a baseline, we can conduct experiments by altering the hyperparameters of the system. The hyperparameters I will discuss are the learning rates of the generator and the discriminator as well as the capacity of each network. The capacity of each network can be manipulated with n_dim parameter and we can change the learning rate of the optimizer directly. For reference, the n_dim parameter is only proportional to the capacity of the generator and discriminator networks. Each network actually consists of hidden layers, each consisting of (n_dim*64)2 weights. If we are not careful, the network size can grow very quickly and the network can actually become prone to memorizing the dataset as discussed in greater depth here [4]. We are in a situation where we are especially prone to memorization because the MNIST dataset only contains 60k images.

When we vary through these hyperparameters, we can observe the effects that these parameters have at different numbers of training epochs.

Iterating Dimension of the prior noise vector, z

For instance, when we modulate the dimension of the noise vector we feed into the generator, we observe some initial improvement when we go from 4 to 8 noise dimensions, but doubling after that does not seem to have a strong visual impact on the quality of the samples. This makes some sense because when there are only 4 noise dimensions, we are limited in the amount of information we can simply encode. For instance, if we encoded some information to 0 and 1 in each latent dimension, then the most information we could encode with 4 latent dimensions would be 8 pieces of information (assuming some linear independence in the latent space between dimensions). Supposing this assumption holds, we could hypothesize that 4 latent dimensions is not enough to encode the qualities of the MNIST data set (you would need at least 10 distinct categories for each type of character and some more to capture the variance that occurs in natural handwriting). We can almost observe this in the example images. The generator struggles to generate distinct characters from an extremely low dimensional latent space, probably because there are not enough free dimensions to encode the required minimum amount of data. Therefore, we observe blurring between characters that look similar, such as 7’s and 9’s.

However, when we double the size of the latent space, the output of the generator is much clearer. I believe this is likely because the generator’s latent space is large enough at this point to encode the semantic qualities of the MNIST data set. Moreover, this is also why I believe we do not see much improvement by increasing the latent space dimension further. It is generally believed that image data sets are low dimensional manifolds within the full image space, and with this notion, we only need low dimensional priors to capture the semantics of the latent space of the MNIST data set. This seems very reasonable since INFOGAN was able to get state of the art generator performance from a latent space with only 12 dimensions (10 discrete dimensions and two continuous dimensions plus some incompressible noise)[4].

This seems very reasonable since INFOGANS was able to get state of the art generator performance from a latent space with only 12 dimensions (10 discrete dimensions and two continuous dimensions plus some incompressible noise) [4].

Varying Capacity of the Generator and the Discriminator

As we increase the capacity of the generator and discriminator, we can see that the quality of the samples drastically improve. On one hand, it is possible that the network memorizes the data set when the capacity grows extremely large (n_dim > 32), but with larger networks we generate decent quality samples with little (10 epochs) training and good sample diversity. In

addition to the effects of network capacity, we can see the effect of imbalances between the generator and the discriminator.

In the figure below, as we traverse down the diagonal, we see a high amount of sample diversity with increasing quality. There is even good sample diversity when the capacity is low such as when n_dim = 4. However, when we go to the lower diagonal (where the generator has about 4 times the capacity (22 difference) of the discriminator, we observe the generator suffers from mode collapse. The generator quickly learns to produce few, specific samples that only reflect part of the data set. This must be a result of the imbalance in capacities since the latent dimension is held at 128 in this experiment, so the dimension of the latent space dimension will not be the source of the bottleneck in feature representation. Furthermore, we can see that low capacity networks when the networks are balanced do not produce the same level of mode dropping. Finally, if we traverse the upper diagonal, the discriminator is likely able to better divide the data manifold and generated sample manifold, leading to the generator not receiving enough useful gradient information to produce higher quality samples.

Varying the learning rates of the Adam Optimizers

In this experiment, I iterated through training the network to 10 epochs using different learning rates of the Adam optimizer and I only reported results which produced decent samples. We can observe that the network’s stability and convergence properties are highly sensitive to small changes in the learning rate. In particular, we can see that learning rates between 0.0001–0.00005 produce stable results quickly. Perhaps if we carry out training longer, we would see the higher learning rates become unstable, but in experiments not listed I have trained this

network using learning rates of 0.0001 and 0.00001 respectively for the generator and discriminator networks and they remained stable for hundreds of epochs. We see that when using high learning rates, the training of the network is unstable. Moreover, when the learning rates are low, but when the discriminator has a high learning rate, the learning is also unstable. We can infer that lower learning rates generally lead to stable training and that we generally want to have an equal or slightly higher learning rate for the generator network.

Plot of the Loss vs Iteration:

Finally, I want to show plots of the loss function over the training periods to highlight that convergence of the loss function does not necessarily guarantee that the generator sample quality has substantially improved.

We see that even though the generator loss is very low, sample images may still be extremely poor. We also observe the sample quality improves even though the loss seems to have stopped significantly decreasing. Finally, when we compare the above plots showing the loss for two separate networks, we can see that even though they converge to the same loss, the quality of samples from each network are extremely different. Clearly, the samples from the network in the left plot are much higher quality even though the loss converges to about .4 in both cases. For these reasons, the log likelihood loss associated with the original GANs is not necessarily indicative of quality of the samples and does not necessarily measure sample quality. In general, this argument can be extended to any f-divergence between distributions on disjoint supports.

We can also compare the performance of different loss metrics as well. Below, we train the network using a MSE Loss in the style of LSGANs [5]. Since minimizing the MSE loss is shown to be equivalent to minimizing the 𝝌2 Pearson divergence, we would expect similar behaviour during training since the JSD (minimized by log likelihood loss and MSE loss both are minimizing f-divergences), which is what we observe. Moreover, minimizing the MSELoss is shown to have the extra benefit of pulling all generated samples closer to the decision boundary, thus preventing the generator from learning poor adversarial examples. They both also appear to converge about the same number of iterations.

Final Thoughts

For anyone aspiring to learn about GANs, creating a simple network, such as the one presented here is a very productive first step. In this simple framework, we can see the typical problems we face when creating and training GANs as well as some ways to alleviate them. In my next posts, I plan to introduce Conditional GANs, DCGANs, as well as INFOGANs.

Code for Implementation

Resources:

[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.

[2] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reduc-ing internal covariate shift, 2015.

[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,Advances inNeural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.

[4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:Interpretable representation learning by information maximizing generative adversarial nets, 2016.

[5] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, Stephen Paul Smolley: “Least Squares Generative Adversarial Networks”, 2016; [http://arxiv.org/abs/1611.04076 arXiv:1611.04076].

- Brian Loos

--

--

xNeurals

We have a passion for Neuroscience.. hands-on interest in applying DeepLearning to model a Hyperactive Brain