machine learning - CV error larger then test set prediction error -
machine learning - CV error larger then test set prediction error -
i'm using scikit-learn's randomforestregressor
build model 1 of data-sets, along gridsearchcv
determine model hyperparameters. evaluate predictive capability of model splitting total data-set train , test sets 80/20 split. model selection using grid search performed on train set , best model used predict on test set. i'm consistently seeing cv r^2 score best grid searched model lower r^2 score when using model predict independent test data. persists across multiple random train/test splits. behavior seems pretty odd me, , i'm not sure if i'm doing wrong, if behavior normal, or if it's possible info quirky.
the pertinent code below (i'm using pca on input features part of modeling pipeline). input info consists of 3 features (after pca), target info consists of 5 features, , data-set contains 100 samples.
# split info train , test sets x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=1) # feature transformation , model fitting pipeline pca = pca(n_components=3) clf = randomforestregressor(n_estimators=200, random_state=0) pipe = pipeline([('pca', pca), ('rf', clf)]) # cross validation cv = kfold(len(y_train), n_folds=10, random_state=0) # grid search grid_max_depth = [1, 2, 3, none] grid_max_features = [1, 2, 3, 'auto'] grid_min_samples_split = [1, 2, 3] grid_min_samples_leaf = [1, 2, 3] param_grid = {'rf__max_depth': grid_max_depth, 'rf__max_features': grid_max_features, 'rf__min_samples_split': grid_min_samples_split, 'rf__min_samples_leaf': grid_min_samples_leaf} clf_grid = gridsearchcv(pipe, param_grid, cv=cv, scoring='r2', verbose=1, n_jobs=2) # fit model clf_grid.fit(x_train, y_train) # best cross-validation score grid search cv_score = clf_grid.best_score_ # predict independent test data, score prediction test_score = clf_grid.best_estimator_.score(x_test, y_test)
example cv , test set r^2 scores 5 different random train/test splits are:
cv_score | test_score --------------------------- 0.4556 | 0.6061 0.5005 | 0.6568 0.4566 | 0.5293 0.4767 | 0.6806 0.5222 | 0.6404
any help or insight appreciated!
machine-learning scikit-learn prediction random-forest cross-validation
Comments
Post a Comment