Nested CV

Misunderstanding of Nested CV and CV

Gridsearch with StandardScaler()

Ref You should do the same preprocessing on all your data however if that preprocessing depends on the data (e.g. standardization, pca) then you should calculate it on your training data and then use the parameters from that calculation to apply it to your validation and test data.

For example if you are centering your data (subtracting the mean) then you should calculate the mean on your training data ONLY and then subtract that same mean from all your data (i.e. subtract the mean of the training data from the validation and test data, DO NOT calculate 3 separate means).

For cross-validation, you'll have to calculate it for each iteration on the folds in the training set and then apply that calculation to the validation fold. If you then train a model using all your data after that, then you need to find the parameters for the preprocessing step using all the CV data.

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

PreviousFeature selection Nextnegative MSE

Last updated 5 years ago

Was this helpful?