VGG nets and ResNet

8 min readSep 2, 2021

Background:

While I want to keep my posts focused on GANs with the goal of explaining full video generation, it is almost impossible to understand modern deep learning without understanding VGG-19 and ResNet. The widespread success of these networks in classification and localization tasks and, moreover, their ability to generalize their ‘knowledge’ once trained on a substantial enough data set is extremely impressive. In their experiments, they both illustrate the ability of their deep convnets to generalize their learned knowledge to new data sets and categories. Both of these networks leveraged ‘deep’ neural networks, with some of the ResNet experiments utilizing over 100 layers.

Krizhevsky et al. [4] were the first to show that convolutional networks which utilized a relatively large number of layers (5) were able to achieve state of the art performance on the modern classification tasks with fewer parameters than their competitors. This initial finding sparked research into deep convolutional neural networks for tasks within the image domain. At this time, the benefits of deep networks were not known and design of these early deep conv nets was an open question. The primary design question is how to balance network depth and convolution kernel size. Increasing network depth can allow it to learn higher level features and increasing kernel size allows a network to localize information better. Since memory and computation are both finite resources, finding the right balance of both of these elements is key to designing a successful convnet.

3x3 Convolution Filter

Among these early networks, VGG-19 published by Oxford Visual Geometry Group was very successful and they accredited this to the depth of their network[2]. They found that by only using 3x3 convolution kernels, they could create a network with far more depth while maintaining a manageable number of model weights. They are the first paper to my knowledge to explicitly recognize the two main advantages of using only 3x3 convolution kernels: reduced memory usage and increased model nonlinearity.

In a convolutional neural network, the number of weights for a single layer is roughly proportional to the number of channels and square of the filter size. For instance, a single neuron with a receptive field of 3x3 would have 9 weights associated with it and a neuron with a 7x7 receptive field would have 49 weights associated with it. Simonyan and Zisserman realized that they could achieve large receptive fields using stacks of 3x3 convolutions. For instance, if we consider two 3x3 convolutional layers stacked on top of each other, then we can see that the neuron at the bottom of the stack has an effective 5x5 receptive field from the input at the top of the stack. This logic also continues for larger stacks. A stack of 3, 3x3 convolutional layers has an effective receptive field of 7x7, a stack of 4, 3x3 convolutional layers has an effective receptive field of 9x9, and so on. While the receptive fields may be the same, the number of weights used to achieve this is not. Recall, a neuron with a 7x7 receptive field has 49 weights associated with it. A stack of 3, 3x3 convolutional layers has the same effect, but only requires 3 * (32), (27), weights to achieve this. Therefore, stacking layers is a more efficient use of memory than widening receptive fields and can lead to increased network performance for a given number of parameters as shown by VGG-net.

The second key observation they made about 3x3 convolutional layers is that a stack of 3x3 layers has increased nonlinearity over its larger kernel counterpart. In other words, a stack of 3, 3x3 convolutional layers can inject nonlinearity in the form of normalization and activation functions between each layer, whereas an ‘equivalent’ 7x7 layer can only apply a single normalization and activation to its output. This aids the network in approximating more nonlinear functions.

Overcoming depth, residual connections

The performance of Simonyan and Zisserman’s VGG-19 network provides a very convincing case for the effectiveness of deep neural networks over other architectures. With this information, the naive approach would be to stack 3x3 convolutional layers with some form of downsampling and call it a day. However, training becomes extremely difficult with these deeper networks. As noted in [3], many groups were finding that after about 20 convolutional layers, performance deteriorated. He et al [3] conjectured and experimental showed that the cause of this deterioration in performance was due to instability when training, not the depth of the network. Taking inspiration from preconditioning in fields such as numerical linear algebra and numerical partial differential equations, He et al. propose framing convnet training as a residual problem as they do in ResNet.

The motivating intuition behind ResNet is simple: given two networks with the same architecture, if one network has more layers than the other, then it should perform no worse than the other network. To see this, we can allow every extra layer to take on an identity mapping (basically doing nothing), and we should be able to recover the performance of a shallower network. He et al. concluded that the observed poor performance of very deep networks was due to numerical instability in training. In particular, layers were not able to accurately recover near-identity mappings. To overcome this, He at al. reframe deep network training as a residual problem since it is numerically ‘easier’ for a composition of nonlinear functions to converge to a zero mapping rather than an identity mapping [3].

To achieve this, He et al. introduce residual functions in the form of residual blocks. In a typical neural network, we can assume that a block of layers effectively approximates some continuous function, G(x), by the Universal approximation theorem [5]. However, it may be difficult for a series of non linear layers to stably converge to an identity mapping or corresponding projection mapping. Therefore, He et al. introduce the shortcut connection shown in Figure 2 from [3]. The hope is that the entire block still approximates some function, G(x), but now the network layers should approximate some residual function, F(x) so that the result of a forward pass through the entire block is G(x) = x + F(x). If the desired underlying map is an identity or projection map, we observe that G(x) ≅ x and F(x) ≅ 0. It was observed that learning a zero mapping is far easier than learning an identity/projection mapping. Thus if their original proposition is correct (that deep nets should perform no worse than their shallow counterparts), then these residual connections should let that be possible.

In their implementations of ResNet, they found that when placed against networks with identical depth and parameter count, such as nets in the style of VGG nets, ResNets performed better with increasing layer count. Comparisons of these results are plotted in Figure 4 from the original paper [3]. Whereas the plain nets suffered from layer degradation with increasing layer count, ResNets saw continual improvements as they increased the layer count. These improvements carried into validation testing as well, showing that the deeper nets also have improved ability to generalize. They even experimented with training a net with over 1000 layers, and even so they began to see overfitting to the dataset before model degradation. Not only did they find that residual connections aided in preventing model deterioration, but they also found that the layer activation in ResNets were generally smaller compared to plain net counterparts. This supports their original hypothesis since small network activations in the residual network can be attributed to the use of the skip connections [3].

A final detail about ResNet[3] compared to VGG nets [2] is that ResNets utilize strided convolutions rather than explicit downsampling to achieve downsampling. Many modern convnets utilize this approach and to my knowledge the first group to use this was the all convolutional net in [1]. Since then, it has become commonplace to replace all pooling operations with strided convolutional layers.

Implementation and results:

To implement these networks in Pytorch for myself, I tried to follow the papers as best as I could and to clean up the code and make it more pythonic, I took influence from the official torchvision repository for ResNet located here. I primarily used the idea of creating separate classes for the residual blocks and then building the final networks in a separate class.

I want to make it very clear that I am not a computer scientist and I have never studied computer science in great depth on my own or in college. Almost all of my experience has been in writing high level scripts to run numerical programs and manipulate data, so I do not have much experience writing object oriented code. For lack of a better way of putting it, taking strong influence from and rewriting existing code has been an extremely useful exercise for me to learn good practices.

In addition, starting this week, I will be posting my code to my github repo here instead of embedding it into posts. I feel as posts get longer and projects become more complicated moving to github repos will be a better move for future posts. My previous posts’ code should all be included as well. Prior to this week, I worked out of a mixture of python scripts and iPython jupyter notebooks, and my repo reflects that. Overtime, I plan to migrate all to one platform, but I have not chosen yet which format (python scripts or notebooks).

Below is a plot of mean training loss from a simple ResNet. The original paper, [3], discusses variants of ResNet designed for ImageNet classification based on 256x256 images as well as simplified variants designed for 32x32 down samples of the CIFAR10 dataset. Using the simplified variant, I reproduce the results from the original paper. It is difficult to see due to scale, but the training loss of the CIFAR10 dataset is very low after only 80k iterations. This took about 30 minutes on my computer, so the performance for this short training is very good.

When validating the top-1 accuracy of this network, I get a pass rate of 84.7%. This is not as good as the results in the original paper, however I do no conditioning on the dataset. In the paper, they take multiple crops over the training set. I did not do this with my data, which is a likely reason why my trained ResNet does not generalize as well as those presented in the original paper.

I want to present VGG-16/19 training results, but I could not fit the model and training sets onto my GPU to train. I leave the code on my github, but I do not have a trained model or results to discuss.

Resources:

[1] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net, 2015.

[2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recog-nition, 2015.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition,2015.

[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,Advances inNeural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.

[5] George Cybenko. Approximation by superpositions of a sigmoidal function, 1989.

— Brian Loos

VGG nets and ResNet

3x3 Convolution Filter

Overcoming depth, residual connections

Implementation and results:

Resources:

Written by xNeurals