-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skorch inference in cross validation #1062
Comments
Okay, so if I understand your question correctly, you have run a grid search and determined the best hyper-parameters. Now you would like to take those best parameters and train a new net with these values. One way to achieve this is to simply redefine the net and pass the parameters as returned by However, it appears that you also try to load the net params (not the hyper-params) of the best model. This should not be done. Instead, create a new net and fit it on the whole training data (never fit on test data!). In simplified code, this would look something like this: # setup
X_train, y_train, X_test, y_test = ...
hyper_params = {...}
net = NeuralNetClassifier(...)
# perform grid search
grid_search = GridSearchCV(net, hyper_params, scoring=scoring, ...)
grid_search.fit(X_train, y_train)
# apply best hyper-params to the net
print("Applying best hyper-parameters:", grid_search.best_params_)
net.set_params(**grid_search.best_params_)
# train the net with the best hyper-params using training data
net.fit(X_train, y_train)
# now evaluate the model on the test data
y_pred = net.predict(X_test)
from sklearn.metrics import precision_score
prec_macro = precision_score(y_test, y_pred, average="macro")
... |
Many thanks for your tips, if I understand, cross validation mainly use for tuning the best parameter and its application for preventing over fitting is not common and how to use the final model on new data is not well addressed. As model with best parameters on train data may not best on new data. I tried 5 different models for 5 fold cv in pytorch, all accuracy on the training data was above 70 %, but some of theses model had acc under 50 and one of them was 32% on new data, of course in final acc we select best model on new data. So is it enough to save best model in cross validation on train data or we need to save all the models? |
Just to be clear, grid search (and similar methods like randomized search) are intended to figure out the best hyper-parameters, which I think is what you intend to do here. They do this by trying out a bunch of different sets of hyper-parameters and for each set run a cross validation (which you could do manually with Preventing overfitting is not necessarily a goal in and of itself. E.g. you could use a super small model with only 1 parameter, which will probably not overfit, but this model will also be very bad overall. What you most likely want is to have a model that works really well on the real world data. This model could be overfitting, but do you really care that much if it still works really well? For many problems, you can basically ignore the training scores and just look at the validation scores that
Is it the accuracy on the training data? Or on the validation data? It's important to be precise here. Remember that that sklearn will generally report validation data when calling Moreover, it sounds like you have a separate test set that you use to check the final model, right? Of course, this is a good thing. However, if you find that the validation score and the test score are very different, this is a bad sign. There could be a couple of reasons why they are different: Your dataset size could be too small, the split between train/validation/test could not be random or not representative, leading to bias, there could be leakage or duplicates in the data, etc. Before you proceed, you should figure out why validation scores and test scores are so different.
Just make sure that you don't select the model based on the best score on the test data, as this leads to overfitting on the test data.
This is a misunderstanding. Don't use any of the models that are trained for cross validation or grid search. These models are only there to get an evaluation of how well the model works. Once you have that, train a model with the best hyper-parameters that you determined earlier using the training data, and finally evaluate it on the test set. |
Thanks for your reply, yes it's validation accuracy. |
My sample size is small, total of 616 , 437 for cross validation and 179 for test the final model. And I have duplicate data too, but I split data based on ID and of course I tried the model with removing duplicate data, the results are not different. The train and validation accuracy are different as the classes are imbalance , I managed it by loss function but still test accuracy is different. |
why not? Which model should be selected? |
If I get your recommendation, you mean selection of model should be done after testing models on final test data. |
This is indeed very small and I assume you cannot easily get more data. You could try increasing the number of cv splits to make the results more robust.
If you have control over the splits, you could try using a stratified split to reduce class imbalance.
Select the model based on the best validation score (i.e. data split off from the train dataset). If you use the test data for model selection, it effectively becomes validation data and thus your model will not generalize as well as it could.
The steps would be as follows:
Don't make further adjustments to the model based on the test scores or else you will overfit on the test set. |
Thanks again for your tips and clarification. |
Hello every body
I 'm using cross validation(cv) for a classification problem, I spilt my data into test and train, I used train for cv model, and test data for inference step.
my code for cv and early stopping at the same time is :
now when I try to save the model as following code, it return erroe "NotInitializedError: Cannot save state of an un-initialized model. Please initialize first by calling .initialize() or by fitting the model with .fit(...)."
Is it reliable this code for cv? I want to save 5 separated model for 5 fold CV, but I could not find any related document, appreciated any advice ..
The text was updated successfully, but these errors were encountered: