ICML'13
Ian J. Goodfellow - David Warde-Farley - Mehdi Mirza - Aaron Courville - Yoshua Bengio
3 mai 2018 - Antoine Toubhans
Abstract
We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout’s fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR100, and SVHN.
A regularization technique for NN
The paper's thesis
Benchmarks
Dropout training is similar to bagging. [2]
[2] Breiman, Leo. Bagging predictors. Machine Learning, 24 (2):123–140, 1994
BAGGING = Bootstrap AGGregatING
Train different models on different subsets of the data
e.g. random forest is bagging of decision trees
Error of the i-th model on the test set:
Approximation of bagging an exponentially large number of neural networks.
For each training example:
Arithmetic mean:
Untractable!
Geometic mean:
The trick: use geometric mean instead
If
Then
Approximations:
the weight scaling rule
works well empirically
"
[2] Deep Learning Book, p188.
"The maxout model is simply a feed-forward architecture[...] that uses a new type of activation function: the maxout unit." [1]
[1] ICML'13, Maxout networks, Goodfellow and Al.
How to compare activation functions:
[3] Deep Sparse Rectifier Neural Networks, Bengio and Al.
In a convolutional network, a maxout feature map can be constructed by taking the maximum across k affine feature maps (i.e., pool across channels, in addition spatial locations).
Maxout Max-pooling !
Any feedforward NN with one hidden layer with any "squashing" activation function and a linear out is a universal approximator [5]
[5] Multilayer feedforward networks are universal approximators, Hornik and Al., 1989
One-layer perceptron can not represent the Xor function [4]
[4] Minsky and Papert, 1969
*
(*) provided sufficiently many hidden units are available
MNIST
60000 x 28 x 28 x 1
10 classes
4 datasets, comparison to the state of the art (2013)
CIFAR-10
60000 x 32 x 32 x 3
10 classes
CIFAR-100
60000 x 32 x 32 x 3
100 classes
SVHN
>600K x 32 x 32 x 3
Identify digits in the images
The permutation invariant version:
The CNN version:
Why are the results better?
4 models
Cross-validation for choosing learning rate and momentum
Having demonstrated that maxout networks are effective models, we now analyze the reasons for their success. We first identify reasons that maxout is highly compatible with dropout’s approximate model averaging technique
Dropout averaging sub-models = divide weight by 2
The author's hypothesis: dropout training encourages Maxout units to have large linear regions around inputs that appear in the training data.
Empirical verification:
The second key reason that maxout performs well is that it improves the bagging style training phase of dropout.
The author hypothesis: when trained with dropout, Maxout is easier to optimize than rectified linear units with cross-channel pooling.
Empirical verification I:
Empirical verification II:
Who's next? What's next?
You decide.