Introduction to GANs(Generative Adversarial Networks)
GANs (generative adversarial networks) are a relatively recent advent in the Deep Learning landscape, yet have already made significant impacts in the field. They were first introduced by Goodfellow et al. in the original GANs paper [1] which introduced an adversarial training framework. This framework allows for the end-to-end training of a generator network, G, and a discriminator network, D, simultaneously. Up until the advent of GANs, training a deep generative model was intractable which the adversarial training framework manages to sidestep. Since their inception, many different modifications on the basic GAN architecture have been proposed to ease training the generator and prevent mode collapse. The most influential improvements over the basic GAN architecture have been Conditional GANs [2], DCGANs [3], INFOGANs [4], LSGANs [5], WGANs [6], and the recently proposed StyleGANs [7]. To understand any of these architectures, we must first understand the original GAN as proposed by Goodfellow et al. [1].
Generative Adversarial Networks (Goodfellow et al.)
In a GAN, the generator network and the discriminator network are pitted against each other in a two player game. The generator network attempts to produce fake data (images, time series, etc…) and the discriminator’s goal is to distinguish fake data from real data. The competition between the two networks “drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles” [1].
We can describe the training objective of this network as a two player minimax game. Given an input, either x or G(z), we want the discriminator to output a 1 when it determines the input is from the real data set, and 0 when the input is a fake input, G(z). However, the generator wants to create samples which are good enough for the discriminator to think they are real. Therefore, the generator wants to produce outputs, G(z), which the discriminator labels as real, 1. This can be described the min-max two player game as described in [1]
If we break down this minimax equation, we can see that the objective for the discriminator is to maximize both terms. Since the outputs of the network, D, lie in the interval [0,1], the values for which the objective function can take are in [-∞,0]. Therefore, when we maximize the first term, (1), with respect to D, we attain said maximum when 𝐷(𝑥) = 1 ⇒ log(𝐷(𝑥)) = 0, ie when the discriminator is labelling real data as real. Likewise, when we maximize the second term (2) with respect to D, the maximum value occurs when 𝐷(𝐺(𝑧)) = 0 ⇒ 𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)) = 0, ie when the discriminator is labelling fake data as fake. On the other hand, when we minimize (1) and (2) with respect to the generator, G, we first observe that G has no effect on (1). Clearly the generator cannot affect the discriminator’s ability to label real data. However, when G aims to minimize (2), it limits towards its minimum as D(G(z)) tends to 0. Thus, as we carry out this two player minimax game, the discriminator improves at determining real data and spotting fake data while the generator improves at fooling the discriminator by creating better fake data.
Goodfellow et al.’s original work [1] included 3 key theoretical results which provide the theoretical foundation for GANs. The following three results are directly taken from [1]. For full proofs, please reference [1].
Proposition 1: For a fixed generator, G, the optimal discriminator D is
Proposition 2: If G and D have enough capacity, and at each step of Algorithm 1(given below), the discriminator is allowed to reach its optimum given G, and 𝑝𝑔is updated so as to improve the criterion
Then 𝑝𝑔 converges to 𝑝𝑑𝑎𝑡𝑎.
Algorithm 1 proposed in [1]
Theorem: The global minimum of of the virtual training criterion C(G) is achieved if and only if
𝑝𝑔 = 𝑝𝑑𝑎𝑡𝑎. At that point, C(G) achieves the value − log 4.
I will give a sketch of the proof of this theorem. By direct inspection, we see when 𝑝𝑔 = 𝑝𝑑𝑎𝑡𝑎, the value is − log 4. To show this is the optimum, we first assume we have an optimal discriminator. We know from the first proposition that when the discriminator is optimal, it attains the particular value. When this is substituted into the inner maximization of the two player min-max game we attain the following
We can recognize this as the sum of two Kullback-Leibler divergences.
Furthermore, we recognize this as the Jensen-Shannon Divergence between two distributions, which attains its minimum only when the two distributions are equal. This occurs if and only if 𝑝𝑔 = 𝑝𝑑𝑎𝑡𝑎.
The key realization Goodfellow et al. made in [1] was that training the adversarial network as equivalent to determining an ideal discriminator and minimizing the Jensen-Shannon Divergence between the distributions 𝑝𝑑𝑎𝑡𝑎 and 𝑝𝑔. If you are not familiar with entropy, cross-entropy, or the Kullback-Leibler Divergence, I highly encourage you to watch the short video here.
Common Shortcomings of GANs
At a glance, the theoretical results of Goodfellow et al. [1] seem to indicate that the GANs training framework provides a solution to all problems plaguing generative deep learning. However, in practice adversarial networks are known to be extremely difficult due to vanishing gradients during training and even when they do converge, they tend to suffer from a phenomenon known as mode collapse. The original paper discovered these issues when they were training GANs based around MLPs. They found that early in training the discriminator converged much quicker than the generator leading to the gradient vanishing during back propagation. Moreover, they also found that if the generator is left to converge to quickly, it would lead to a “‘Helvetica scenario’ in which G collapses too many values of z to the same value of x to have enough diversity to model 𝑝𝑑𝑎𝑡𝑎” [1].
In the years since the original GANs paper [1], many network architectures based on GANs have attempted to overcome these shortcomings [5,6]. A paper by Arjovsky and Bottou took a theoretical dive into the issues the original GAN possessed [8]. This paper alone deserves a separate post of its own for its contributions to the theoretical development of the training of GANs.
Resources:
[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio: “Generative Adversarial Networks”, 2014; [http://arxiv.org/abs/1406.2661 arXiv:1406.2661].
[2] Mehdi Mirza, Simon Osindero: “Conditional Generative Adversarial Nets”, 2014; [http://arxiv.org/abs/1411.1784 arXiv:1411.1784].
[3] Alec Radford, Luke Metz, Soumith Chintala: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, 2015; [http://arxiv.org/abs/1511.06434 arXiv:1511.06434].
[4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel: “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets”, 2016; [http://arxiv.org/abs/1606.03657 arXiv:1606.03657].
[5] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, Stephen Paul Smolley: “Least Squares Generative Adversarial Networks”, 2016; [http://arxiv.org/abs/1611.04076 arXiv:1611.04076].
[6] Martin Arjovsky, Soumith Chintala, Léon Bottou: “Wasserstein GAN”, 2017; [http://arxiv.org/abs/1701.07875 arXiv:1701.07875].
[7] Tero Karras, Samuli Laine, Timo Aila: “A Style-Based Generator Architecture for Generative Adversarial Networks”, 2018; [http://arxiv.org/abs/1812.04948 arXiv:1812.04948].
[8] Martin Arjovsky, Léon Bottou: “Towards Principled Methods for Training Generative Adversarial Networks”, 2017; [http://arxiv.org/abs/1701.04862 arXiv:1701.04862].
- Brian Loos