Comparing Feature Pooling Methods and Fisher Vector in Convolutional Neural Networks

A lot of research has been done in the field of image classification. Image classification is the field of machine learning where the computer determines what the object or objects in the image are. There is a wide variety machine learning methods designed. They can loosely be divided in ‘old’- and ’new’-school methods. There are not many papers comparing the developed methods. The developers of a methods usually compare their method with others, but are these comparisons unbiased?

In this work, we will compare different configurations of the AlexNet convolutional neural network and compare one configuration of the AlexNet with a Fisher Vector neural network. The different configurations of the CNN consist out of three different feature pooling methods, average, maximum and stochastic pooling and four different data augmentation combinations, no, crop, flip and crop and flip augmentation.

We used the Caffe deep learning framework to set up and implement the various network configurations. We used twelve different setups for our experiments. We ran them by using the ImageNet LSVRC-2012 dataset consisting out of 1.2 million training and 50,000 validation images of 1,000 different categories. For the experiments of the Fisher Vector neural network, we used a 10,000 and 1,000 (training respectively validation) images large sample of the ILSVRC-2012.

Unfortunately, we could not get the Fisher neural network working and training. So, we do not have any results for those experiments. Fortunately, we do have results for the other experiments. There was a clear difference in results with the different pooling methods. Maximum pooling preformed best with an error rate of 21.8%. This configuration was using crop and flip augmentation. There was no clear best augmentation over all the feature pooling methods.

PDF