-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is data distribution consistent between training and test sets? #17
Comments
One important detail - in both scenarios (i.e. original test/train split and the second one with merging) the training set used is shuffled prior to classifier training. This is to make sure the effect is not due to the ordering of the original training set. |
Thank you for producing this! However, I noticed the same thing as @fruboes when it comes to KMNIST -- test accuracy is significantly lower than I would expect. However, this GitHub issue seems to be the only mention of this potential problem, so maybe I'm overlooking something. Anyway, even using one of the benchmark files in this repository -- kuzushiji_mnist_cnn.py -- I can produce a test accuracy that is unusually low. To demonstrate it, I shuffled the training set, after which I reserved 50000 images for training and 10000 images purely for evaluation after training; the latter would be used to compare with test-set performance. Here is the modified code: # Based on MNIST CNN from Keras' examples: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py (MIT License)
from __future__ import print_function
import keras
import numpy as np
from keras import backend as K
from keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from keras.models import Sequential
batch_size = 128
num_classes = 10
epochs = 12
# input image dimensions
img_rows, img_cols = 28, 28
def load(f):
return np.load(f)["arr_0"]
# Load the data
x_train = load("kmnist-train-imgs.npz")
x_test = load("kmnist-test-imgs.npz")
y_train = load("kmnist-train-labels.npz")
y_test = load("kmnist-test-labels.npz")
if K.image_data_format() == "channels_first":
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255
x_test /= 255
print("{} train samples, {} test samples".format(len(x_train), len(x_test)))
# Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
# Shuffle training data
np.random.seed(0)
perm = np.random.permutation(len(x_train))
x_train = x_train[perm]
y_train = y_train[perm]
# Split training data
num_unused_train_images = len(x_test)
my_x_train = x_train[:-num_unused_train_images]
my_y_train = y_train[:-num_unused_train_images]
my_x_train_unused = x_train[-num_unused_train_images:]
my_y_train_unused = y_train[-num_unused_train_images:]
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation="relu", input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation="softmax"))
model.compile(
loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=["accuracy"],
)
model.fit(
my_x_train,
my_y_train,
batch_size=batch_size,
epochs=epochs,
verbose=2,
)
train_score = model.evaluate(my_x_train, my_y_train, verbose=0)
unused_train_score = model.evaluate(my_x_train_unused, my_y_train_unused, verbose=0)
test_score = model.evaluate(x_test, y_test, verbose=0)
print("Train loss:", train_score[0])
print("Train accuracy:", train_score[1])
print("Unused train loss:", unused_train_score[0])
print("Unused train accuracy:", unused_train_score[1])
print("Test loss:", test_score[0])
print("Test accuracy:", test_score[1]) and the results:
Test accuracy is 55% compared to 71% of the unused partition of the train set. arXiv paper says that "data distributions of each class are consistent between the two sets", but, given the results, I find this statistically implausible. Or I made an error in which case I would be grateful if you could point to it. |
I wanted to see how classifier accuracy for Kuzushiji-MNIST behaves wrt varying training set size. My procedure was the following:
To my surprise, classifier efficiency obtained with cross-validation was always significantly higher than the one obtained with the test set:
train_size: 200
train_size: 500
train_size: 1000
After a fair amount of debugging I've found out, that effect completely disappears if I merge train and test parts, shuffle the result, and then define new train and test datasets (60000 and 10000 images respectively, with class balancing ensured). For such datasets the results seem consistent for both methods used:
train_size: 200
train_size: 500
train_size: 1000
The above may suggest, that the original training and test parts of Kuzushiji-MNIST are somehow different. Could you have a look at this? Please find the test code producing the above results attached ( compare_xval_and_test.zip ).
The text was updated successfully, but these errors were encountered: