This is an attempt to replicate the following paper as the hyperparameter link is not working in the paper.
arXiv:1302.4389 [stat.ML]
The following diagram shows the maxout module with multilayer perceptrons.
Train : (first 50000 training data) - python mnist.py --mlp 1 --train true
Validation : (remaining 10000 training data) - python mnist.py --mlp 1 --valid true
Train Continuation : (whole train data, continue from previous training) - python mnist.py --mlp 1 --train_cont true
Testing : python mnist.py --mlp 1 --test true
For complete hyperparameter tuning check hyper-tuning.rst
file.
Epochs
Batch size
Layer1
Layer2
Accuracy
(%)
Loss
Number of
layers
Number of
Neurons
Number of
layers
Number of
Neurons
5
64
4
2048
2
10
97.79
1.5060
5
64
4
1024
2
10
97.44
1.5107
Training
Epochs
Batch size
Layer1
Layer2
Accuracy
(%)
Loss
Number of
layers
Number of
Neurons
Number of
layers
Number of
Neurons
5
64
4
2048
2
10
96.94
1.5097
5
64
4
1024
2
10
96.83
1.5108
It has been trained further with whole training
dataset with the following accuracies and loss.
Training with pretrained weights
Epochs
Batch size
Layer1
Layer2
Accuracy
(%)
Loss
Number of
layers
Number of
Neurons
Number of
layers
Number of
Neurons
5
64
4
2048
2
10
99.02
1.4827
Batch size
Layer1
Layer2
Accuracy
(%)
Loss
Number of
layers
Number of
Neurons
Number of
layers
Number of
Neurons
64
4
2048
2
10
97.17
1.5007
Train : (50000 shuffled training data) - python mnist.py --conv 1 --train true
Validation : (remaining 10000 training data) - python mnist.py --conv 1 --valid true
Train Continuation : (whole train data, continue from previous training) - python mnist.py --conv 1 --train_cont true
Testing : python mnist.py --conv 1 --test true
First learning rate is set to 0.01
. Then it is halved at epoch 5
for training of 50000
shuffled data. With least error for validation, it is retrained with the pretrained weights. But this time the starting learning rate is 0.001
, it is halved at epoch 5
.
The architecture presented in paper is as follows:
conv -> maxpool -> conv -> maxpool -> conv -> maxpool -> MLP -> softmax
.
It is evident that the output of MLP is 10
and the input of MLP is whatever number comes from
3rd maxpool
. Only I had to adjust was kernels, paddings of convolutional layers. Because those
are the only parameters in the network.
Epochs
Batch
Conv1
Maxpool1
Conv2
Maxpool2
Conv3
Maxpool3
MLP
Acc %
Loss
kernel
pad
pool
stride
kernel
pad
pool
stride
kernel
pad
pool
stride
in
out
10
64
7 x 7
3
2 x 2
1
5 x 5
2
2 x 2
1
5 x 5
2
2 x 2
1
625
10
97.09
1.4921
10
64
5 x 5
3
2 x 2
1
5 x 5
2
2 x 2
1
5 x 5
2
2 x 2
1
729
10
87.62
1.5856
10
64
5 x 5
3
2 x 2
1
3 x 3
2
2 x 2
1
3 x 3
2
2 x 2
1
961
10
95.43
1.5088
10
64
5 x 5
2
2 x 2
1
3 x 3
2
2 x 2
1
3 x 3
2
2 x 2
1
841
10
95.96
1.5037
Batch
Conv1
Maxpool1
Conv2
Maxpool2
Conv3
Maxpool3
MLP
Acc %
Loss
kernel
pad
pool
stride
kernel
pad
pool
stride
kernel
pad
pool
stride
in
out
64
7 x 7
3
2 x 2
1
5 x 5
2
2 x 2
1
5 x 5
2
2 x 2
1
625
10
96.85
1.4928
64
5 x 5
3
2 x 2
1
5 x 5
2
2 x 2
1
5 x 5
2
2 x 2
1
729
10
87.76
1.5828
64
5 x 5
3
2 x 2
1
3 x 3
2
2 x 2
1
3 x 3
2
2 x 2
1
961
10
95.16
1.5828
64
5 x 5
2
2 x 2
1
3 x 3
2
2 x 2
1
3 x 3
2
2 x 2
1
841
10
96.15
1.5012
Epochs
Batch
Conv1
Maxpool1
Conv2
Maxpool2
Conv3
Maxpool3
MLP
Acc %
Loss
kernel
pad
pool
stride
kernel
pad
pool
stride
kernel
pad
pool
stride
in
out
10
64
7 x 7
3
2 x 2
1
5 x 5
2
2
1
5 x 5
2
2
1
625
10
97.58
1.4874
10
64
5 x 5
3
2 x 2
1
5 x 5
2
2
1
5 x 5
2
2
1
729
10
88.04
1.5811
10
64
5 x 5
3
2 x 2
1
3 x 3
2
2
1
3 x 3
2
2
1
961
10
96.25
1.5011
10
64
5 x 5
2
2 x 2
1
3 x 3
2
2
1
3 x 3
2
2
1
841
10
96.75
1.4960
Batch
Conv1
Maxpool1
Conv2
Maxpool2
Conv3
Maxpool3
MLP
Acc %
Loss
kernel
pad
pool
stride
kernel
pad
pool
stride
kernel
pad
pool
stride
in
out
64
7 x 7
3
2 x 2
1
5 x 5
2
2 x 2
1
5 x 5
2
2 x 2
1
625
10
96.87
1.4929
64
5 x 5
3
2 x 2
1
5 x 5
2
2 x 2
1
5 x 5
2
2 x 2
1
729
10
87.39
1.5861
64
5 x 5
3
2 x 2
1
3 x 3
2
2 x 2
1
3 x 3
2
2 x 2
1
961
10
95.52
1.5070
64
5 x 5
2
2 x 2
1
3 x 3
2
2 x 2
1
3 x 3
2
2 x 2
1
841
10
96.30
1.4989