Cycle-GAN

7 min readSep 10, 2021

In my two most recent posts, I discussed conditional GANs as well as ResNets and VGG-16/19 in detail. I used a conditional GAN to learn to create images of numbers using the MNIST handwritten data set. The simple GAN I created worked well and was able to learn how to write different numbers with variation. This week I worked on a similar project at the intersection of all of these concepts, image to image translation. Image to Image translation is defined as the task of transferring the semantic context of an image in one set, X, to another set, Y. For example, translating satellite imaging to street maps, converting zebras to horses, and transforming pictures to paintings.

Fig 1: Example outputs from cycle-GAN [1]

Generative networks such as encoders and auto encoders are naturally suited to image to image translation tasks and an adversarial framework provides an effective way for ‘learning’ the correct loss function. One of the simplest neural networks we could design for image to image translation would have the shape of an autoencoder or VAE [3]. Autoencoders roughly work by stacking a decoder on an encoder. The encoder converts the image to some latent code and the decoder converts it to the final output image. We can improve this architecture by introducing skip connections as done in [4]. However, in both cases, we have two major shortcomings: (1) we must define a loss function on images and (2) the generator learns a mapping between image pairs, not between distributions. To address the first shortcoming, we can introduce a new loss: the perceptual loss [2]. It has been shown repeatedly that training a generator on images using L1 or euclidean loss alone results in blurry and unclear images. [2] shows that by comparing the L2 loss of the activations caused by feeding two images through a pre-trained deep convnet can allow the generator to learn to generate much better images (perceptual loss).

Even with a perceptual loss, if we opt to work in a non-adversarial framework, then we can still only hope to learn a mapping between image pairs. For instance, a neural network such as the one proposed in [2] learns a mapping from images to images stylized based on a single other photo as shown in figure 2.

Fig 2: Example outputs from “Perceptual Losses for Real-Time Style Transfer and Super-Resolution” [2]

We can attempt to overcome both of these shortcomings by introducing an adversarial training framework. GANs overcome (1) easily since they learn an underlying loss function that is application appropriate. One of the first GANs to gain popularity for image to image translation was pix-2-pix in [5]. They introduced a conditional GAN where the generator was a U-net style auto encoder which was trained on image pairs. They showed that pix-2-pix was capable of learning how to infill, segment, and colorize images (among other applications). Pix-2-pix also introduced a novel discriminator they called the PatchGAN discriminator. Rather than trying to label an entire image as real or fake, a PatchGAN works on H x W patches and scales to arbitrary sized images since it is fully convolutional. They found that this discriminator is able to perform better while having few parameters than a full image discriminator.

Despite its success, pix-2-pix still suffers from requiring labelled training pairs such input images and ground truth segmentation maps. Therefore, the underlying network is learning a mapping from individual images to their corresponding target images. This is fine for some image to image translation problems where the problem is one-to-one such as image segmentation. However, for some image to image translation problems, we want to learn a one-to-many mapping such as for infilling or colorization. Since pix-2-pix is trained on discrete pairs, we learn a deterministic mapping from one set of images to the other. However, it would be desirable to learn a mapping between the image sets at the distribution level. Cycle-GAN overcomes this problem by introducing a novel adversarial training framework for image to image translation.

Cycle-GAN architecture

Rather than constraining the image to image translation problem by a conditional argument passed to the generator and discriminator, cycle-GANs aim to constrain the neural network by a ‘cycle consistency’ constraint [1]. The cycle consistency constraint tries to minimize a reconstruction error.

“Our goal is to learn mapping functions between two domains X and Y given training samples {xi}i=1N where xi ∈ X and {yj}j=1M where yj ∈ Y We denote the data distribution as x ∼ pdata(x) and y ∼ pdatat(y). As illustrated in Figure 3 (a), our model includes two mappings G : X → Y and F : Y → X. In addition, we introduce two adversarial discriminators DX and DY , where DX aims to distinguish between images {x} and translated images {F(y)}; in the same way, DY aims to discriminate between {y} and {G(x)}. Our objective contains two types of terms: adversarial losses [16] for matching the distribution of generated images to the data distribution in the target domain; and cycle consistency losses to prevent the learned mappings G and F from contradicting each other.” [1]

This framework takes inspiration from common ideas from natural language processing called “back translation and reconciliation” [1] where we can verify our model’s consistency and performance by how close the back translation is to the original phrase. In the same sense, when we translate an image through network G and back through network F, we want the cycle to be consistent, and we would like to recover the original image. With this intuition, we can build a cycle-GAN in the style of Zhu et al. [1]. Their original code is available here.

Cycle GAN differs from a standard GAN or conditional GAN in that the network has two separate sets of generators and discriminators. There is the generator, G, which maps source images, x, into the target context images, y*, and its corresponding discriminator, Dy. Likewise, we have the generator, F, which maps context images, y, into the source images, x*, and its discriminator, Dx. We connect the networks as shown in figure 4. If we follow each colored arrow in the figure 4, we can see the 6 losses which are computed for a Cycle-GAN. Each discriminator is trained in a usual fashion using Euclidean Loss as is done in LSGANs[6], whereas the cycle losses and identity losses are trained using either L1 or Euclidean loss as well. In my implementation, I used L1 loss as is done by Zhu et al. My implementation is available on my github: here.

Cycle-GAN introduces three novel losses to a GAN architecture: forward cycle loss, backward cycle loss, and the identity loss. These losses constrain the generators G and F so that they can more closely model the desired distributions Y and X. The cycle losses impose a cycle consistency constraint on the entire network and the two identity losses help the generators learn their target distributions.

Since this network doesn’t make any application specific design choices, it is suitable to apply to any image to image translation problem such as style transfer, segmentation, and colorization. Their generator networks make use of fully convolutional layers with residual connections in the style of Johnson et al [2]. Likewise, the discriminator makes use of 70x70 PatchGAN discriminators from [5]. This makes the network suitable to use on arbitrarily sized images with good scalability.

Implementation:

In my implementation of Cycle-GAN, I implemented Cycle-GAN as described in the original work [1] except for their implementation of Shrivastava et al’s discriminator training technique [7]. In training, I used 7 residual blocks for training on 512x512 rgb images. I tested my network by training it on a data set consisting of cats (X) and monet paintings (Y). The datasets I trained on were quite small (26 cat images) and (73 monet paintings). I found I got better results in a reasonable time on my image set when I trained on smaller datasets. I did find that these networks do use more memory than I have easily available to train since we have to load 4 deep networks onto a GPU and any data we want to train on top of that. I admit that there are probably parts of my code which could be made more efficient, but even the original implementation notes that the network does consume a large amount of VRAM [1]. I followed the original implementation in that I trained all networks using Adam optimizers with learning rates set to 0.0002 which decayed after the first 100 epochs. I trained for a total of 200 epochs.

Below I include best results as well as a training sample from the 200 training epochs. There is a large quality improvement part way through because I made a small change in my Monet painting data loader. Before the quality improved, I was training using center crops from the Monet paintings. I swapped center crops for random crops and the quality improved drastically. Ideally, I would have been using random crops the entire time, but, unfortunately, I did not.

Resources:

[1] Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros: “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”, 2017; [http://arxiv.org/abs/1703.10593 arXiv:1703.10593].

[2] Justin Johnson, Alexandre Alahi, Li Fei-Fei: “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, 2016; [http://arxiv.org/abs/1603.08155 arXiv:1603.08155].

[3] Diederik P Kingma, Max Welling: “Auto-Encoding Variational Bayes”, 2013; [http://arxiv.org/abs/1312.6114 arXiv:1312.6114].

[4] Olaf Ronneberger, Philipp Fischer, Thomas Brox: “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015; [http://arxiv.org/abs/1505.04597 arXiv:1505.04597].

[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros: “Image-to-Image Translation with Conditional Adversarial Networks”, 2016; [http://arxiv.org/abs/1611.07004 arXiv:1611.07004].

[6] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, Stephen Paul Smolley: “Least Squares Generative Adversarial Networks”, 2016; [http://arxiv.org/abs/1611.04076 arXiv:1611.04076].

[7] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, Russ Webb: “Learning from Simulated and Unsupervised Images through Adversarial Training”, 2016; [http://arxiv.org/abs/1612.07828 arXiv:1612.07828].

Cycle-GAN

Cycle-GAN architecture

Implementation:

Resources:

Written by xNeurals

No responses yet