Convolutional Neural Networks
Background on Convolution Neural Networks
By no means are Convolutional Neural Networks a new technology. They have been our best digital model of a living organism’s visual perception since at least 1980 when Fukushima proposed the Neocognitron as “A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position” [1]. Even so, this early model was heavily inspired by observations of the natural world. Research by Hubel and Wiesel [2,3] analyzed the striate cortex of cats and monkeys, revealing two key findings that would come to heavily influence Fukushima’s work [1]. They found the first layer of cells possess a receptive field, meaning that each cell in the first layer would only respond to changes in lighting/color in a small part of a creature’s field of view. Secondly, they found that the striate cortex of the monkey was structured such that layers of cells alternated complex cells, simple cells, complex cells, simple cells, etc…[2]. They labeled this pattern as: complex cells, simple cells, lower order hyper complex cells, higher order hyper complex cells, and even higher order. An important part of their findings was that modifiable synapses were only found between every other cell layer. In terms of modern machine learning, only every other cell layer has modifiable weights for learning. With this as his inspiration, Fukushima’s Neocognitron [1] implemented what is recognizable now as a 3 layer convolutional neural network and he was the first to observe a convolutional network’s invariance to small distortions and along with its ability to localize information.
The next significant implementation of a convolution neural network was LeNet-5 proposed in 1999 by Le Cun et al. in their work “Object Recognition with Gradient Based Learning’’ [4]. They were able to demonstrate that for the task of “recognizing simple objects with high shape variability such as handwritten characters … Convolutional neural networks [were] shown to be particularly well suited to this task” [4]. Their proposed network, LeNet-5 performed well on the MNIST data set and was shown to do better than state of the art (at the time) SVMs and K-nearest neighbor based approaches.
Despite its success, research into convolutional neural networks slowed until Krizhevsky et al. introduced a deep convolutional neural network for ImageNet classification [5]. Their architecture introduced a network with five convolutional layers and two fully connected layers containing over 60 million parameters. They introduced modern techniques such as overlapping max pooling and dropout which aided training and network performance. For more details on their implementation, I encourage you to read their original paper [5]. Their final implementation outperformed other state of the art image classification algorithms with error rates which were 10% lower than its competitors on the ImageNet dataset. Moreover, they showed potential for improvement by introducing deeper networks containing up to seven convolutional layers. This deep convolutional neural network inspired many highly successful deep CNN architectures such as VGG nets, ResNet, R-CNN, and U-net[6,7,8,9]. Furthermore, with the advent of GANs (generative adversarial networks), we can combine deep convolutional architectures with an adversarial training framework to obtain very good image generation, such as done by DCGANs, INFOGANs, LSGANs, and many more [10,11,12].
What is a Convolution?
Before discussing Convolutional Neural Networks, it is important to discuss the fundamental building block of these networks: convolutions. Convolutions are a mathematical operation for combining functions. In the case of continuous functions, the mathematical definition of a convolution is
We can visualize the continuous time convolution as ‘how much’ two functions overlap at a given point in time. The two following plots represent how continuous time convolutions work. We can observe one function slide over the second function, tracing out some area corresponding to the output of the convolution.
Convolution_of_box_signal_with_itself.gif: Brian Ambergderivative work: Tinos, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons
Convolution_of_spiky_function_with_box.gif: Brian Ambergderivative work: Tinos, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons
**alternate resources for blog
However, in machine learning, we are rarely working with continuous time. Moreover, continuous time convolutions are not of interest to us. Rather, we are interested in discrete convolutions. We formally can define discrete convolutions as the following operation.
When the support of the function g (or f) is finite, we can write the above infinite sum as a finite sum. The support of a function g is the set of points in the function’s domain where it is nonzero.
Example of a Discrete Convolution:
For example, if we let g be the following function:
Then, if f is any discrete function, we can compute the discrete convolutions between f and g.
We can observe that the convolution between f and g will result in a sum over neighboring elements in the domain of y after having f applied to them. This application of a discrete convolution precisely represents local receptive fields observed by Hubel and Wiesel [2,3] and implemented in early CNNs by Fukushima and Le Cun [1,4].
If we allow ourselves a little more freedom in choice of weighting function, g, we observe that we can apply weights on the outputs, f(y). If we let g be
Then we observe that the resulting outputs are identical as before, but each term has an associated weight based on its location relative to the center of a neuron’s receptive field.
In this form, we see that convolutions allow us to take in an input vector, Y, and have each neuron observe only a local part of the vector corresponding to that neuron’s local receptive field. Furthermore, we can extend this idea into two dimensions as is commonly done for image processing. I illustrate an example using a 3x3 receptive field.
Where ⦻ means element-wise multiplication the weight matrix W3 x 3 and neighborhood of the pixel 𝑦 , Y. We will use convolutions of this form as the basis of our convolutional networks.
Convolutional Layers
First, let us recall the most general form of a multilayer feedforward network. This consists of an input layer, output layer, and many hidden layers. In this form, a layer is said to be fully connected when each neuron has a connection to every neuron in the layer before and layer after it, each with its own weights.
Moreover, within each neuron we have an activation function and a bias which maps the output of the neuron back into some interval to pass along to the next layer of neurons. Typical activation functions are sigmoid functions, tanh, functions, and ReLU (rectified linear units). Each maps the neuron output into different intervals.
In convolutional neural networks, we simply replace the fully connected layers with convolutional layers. In the simplest case, we can connect the layers of a neural network with 1-d convolutions. Each convolutional layer has basic hyperparameters which we are free to choose.
- Filter/Kernel size: this is the size of the support of the function which we are convolving. In the above examples, this would be 3. This is also called filter size in some areas because we can think of convolving as moving a sliding filter over the input set. The size of this filter is exactly the size of the kernel.
- Stride: distance between centers of applied kernels.
- Padding: specifies how to handle edge cases. If 0 is given, then we avoid edges, otherwise
- We can use symmetry or zero padding along the edges
Example of 1d Convolutional layers:
Suppose we have a neural network that takes in a 10-dimensional vector as an input. Suppose we want to apply a 1 dimensional convolution layer to this input with a filter size of 4, a stride of 2 with no padding. If we write out this first layer, we will get the following:
We note that this system only has 4 trainable parameters. We see that there is a distance of two between the center of each neuron’s receptive field and we can see that each neuron can only see a neighborhood of 4 input elements. If we allow the single convolution layer to have a bias term as well, then the number of trainable parameters increases to 5.
Adding More Channels
In addition to kernel size, stride, and padding, we can also specify the number of output channels we want at a given convolutional layer. Adding more output channels allows us to potentially extract more information from a given input tensor by performing several convolutions in parallel in a given layer. For example, if we want to repeat the previous example, but with 3 output channels, then we would get
When we begin stacking convolutional layers, we must decide how we want to connect the layers so that we can compute the convolution. An equivalent notion to the idea of fully connected layers would be to let the next convolutional layer compute its convolution based on all the input channels of the preceding layer. If we wanted to add another convolutional layer to this network with M output channels, then we could simply let each neuron ‘see’ all the channels in its perceptive field and compute a convolution on each channel separately, then add the results together. We repeat this process for each neuron, for each output channel. However, we can impose more structure if we desire by forming groups. We can group input channels such that instead of computing the full convolution between layers, we are effectively implementing two smaller convolutional layers side by side, each on half the data, and concatenative their results. This technique was used in the ImageNet deep CNN in [5].
Creating a 2-d CNN
Suppose we want to create a 2-d CNN for image classification of 32 x 32 pixel images. We want to use 2-d convolutions at each layer with batch normalization and rectified linear units as our activation functions. Below is a figure with my proposed network topology.
Each convolutional layer doubles the number of channels and halves the output pixel dimension. I accomplish this by using a filter size of 4 with a stride of 2 with 1 pixel of zero padding. The final convolution layer maps the 256 input channels into 10 output channels for classification. After each convolution batch normalization [13] and a rectified linear unit [14] activation is applied to the result. We apply the above architecture to the MNIST data set in the following implementation.
Implementation:
First we need to import the necessary PyTorch and numpy libraries.
Then I set the machine to use cuda GPU when available. Also disable heuristic algorithms for reproducibility. If speed is the primary concern, then you should remove these lines.
Define the CNN class according to the aforementioned architecture. Also create an init function in the style of DCGANs [10].
Then, we define global parameters.
We need to import the MNIST dataset. This will also download it if we do not already have it installed.
We can define the training loop.
We can also create some helper functions to create plots of our loss and validate our model after training.
Finally, we can create our model and run it.
An example output we get is:
>>Initial Loss : 2.3455
>>Epoch [1/1], Step [469/938], test Loss: 0.2047 >>Epoch [1/1], Step [938/938], test Loss: 0.1303
>>pass rate for trained network = 98.5400%
References:
[1] Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybernetics 36, 193 — 202 (1980). https://doi.org/10.1007/BF00344251
[2] HUBEL, D. H., & WIESEL, T. N. (1959). Receptive fields of single neurons in the cat’s striate cortex. The Journal of physiology, 148(3), 574 — 591. https://doi.org/10.1113/jphysiol.1959.sp006308
[3] Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1), 215 — 243. https://doi.org/10.1113/jphysiol.1968.sp008455
[4] LeCun, Y., Haffner, P., Bottou, L., & Bengio, Y. (1999). Object recognition with gradient-based learning. In D. A. Forsyth, J. L. Mundy, V. di Gesu, & R. Cipolla (Eds.), Shape, Contour and Grouping in Computer Vision (pp. 319–345). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1681). Springer Verlag. https://doi.org/10.1007/3-540-46805-6_19
[5] Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems. 25. 10.1145/3065386.
[6] Karen Simonyan, Andrew Zisserman: “Very Deep Convolutional Networks for Large-Scale Image Recognition”, 2014; [http://arxiv.org/abs/1409.1556 arXiv:1409.1556].
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: “Deep Residual Learning for Image Recognition”, 2015; [http://arxiv.org/abs/1512.03385 arXiv:1512.03385].
[8] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: “Rich feature hierarchies for accurate object detection and semantic segmentation”, 2013; [http://arxiv.org/abs/1311.2524 arXiv:1311.2524].
[9] Olaf Ronneberger, Philipp Fischer, Thomas Brox: “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015; [http://arxiv.org/abs/1505.04597 arXiv:1505.04597].
[10] Alec Radford, Luke Metz, Soumith Chintala: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, 2015; [http://arxiv.org/abs/1511.06434 arXiv:1511.06434].
[11] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel: “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets”, 2016; [http://arxiv.org/abs/1606.03657 arXiv:1606.03657].
[12] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, Stephen Paul Smolley: “Least Squares Generative Adversarial Networks”, 2016; [http://arxiv.org/abs/1611.04076 arXiv:1611.04076].
[13] Sergey Ioffe, Christian Szegedy: “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, 2015; [http://arxiv.org/abs/1502.03167 arXiv:1502.03167].
[14] Nair, V., & Hinton, G.E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. ICML.
- Brian Loos