Scikit‑Learn Model Validation, GridSearch, and Advanced Best Practices (Part 3)


Scikit-Learn (Part 3)

Validation, GridSearch and advanced best practices: how to evaluate models without fooling yourself

In Parts 1 and 2 we built the foundations:

  • you understood Estimators, Transformers and Predictors
  • you learned to chain them into a real Pipeline
  • you avoided leakage and kept train and test consistent

Now comes the part that separates a pretty notebook from an honest, reliable ML system:

model validation and tuning.

Because training a model is easy. Evaluating it well is the hard part. And almost everyone gets it wrong without realizing.

Let’s go step by step.

Diagram showing cross‑validation, GridSearch, and nested CV in a Scikit‑Learn pipeline.


1. The problem: naive validation

Plenty of tutorials do this:

python
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

It works. But it has serious problems:

  • it relies way too much on a single split
  • it doesn’t catch the model’s variability
  • it doesn’t tell you whether your model is stable
  • it doesn’t tell you whether your model generalizes
  • and if you preprocessed outside the Pipeline → guaranteed leakage

Think about it: a single split gives you a single number. And a single number has no context. Is that 0.91 accuracy real, or did you just get an easy test set by chance? You have no way of knowing.

Proper validation isn’t a final step. It’s part of the system’s design.


2. Cross-Validation: the foundation of honest evaluation

Cross-validation (CV) splits the data into several folds and repeats training multiple times, rotating which fold is used for validation.

python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X, y, cv=5)
scores.mean(), scores.std()

Why does it matter so much?

  • it reduces the variance of the evaluation
  • it catches unstable models
  • it avoids relying on a single split
  • it evaluates the preprocessing inside the Pipeline
  • it avoids leakage automatically

And pay attention to a detail almost nobody looks at:scores.std().

The mean tells you how good your model is. The standard deviation tells you how reliable that mean is. Two models withmean = 0.88are not equal if one hasstd = 0.01and the otherstd = 0.09. The second one is a rollercoaster: in production it might hand you a 0.79 on any given day.

If you don’t use CV, you’re not evaluating your model. You’re playing the lottery.


3. StratifiedKFold and choosing the scoring

Two details that make the difference between a toy validation and a serious one.

Stratification. In classification, if your classes are imbalanced (90% A, 10% B), a plain KFold can leave you with a fold that has almost no examples of class B. Scikit-Learn knows this and, for classifiers,cross_val_scorealready usesStratifiedKFoldby default. But it’s worth being explicit when you control the split:

python
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv)

The metric.accuracyis misleading with imbalanced classes: a model that always predicts the majority class can hit 90% and still be useless. Pick the metric based on the problem:

python
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro')

Some options depending on the case:

  • f1_macro/balanced_accuracy→ imbalanced classes
  • roc_auc→ ranking and probabilities
  • neg_root_mean_squared_error→ regression
  • precision/recall→ when one type of error hurts more than the other

The right question isn’t «what accuracy do I have?», but «which error costs me money?».


4. GridSearchCV: tuning without breaking the Pipeline

Manual tuning is slow, fragile and error-prone.GridSearchCVautomates the hyperparameter search and does it inside the Pipeline, so there’s no leakage.

python
from sklearn.model_selection import GridSearchCV

params = {
    'prep__num__with_mean': [True, False],
    'model__C': [0.1, 1, 10]
}

grid = GridSearchCV(pipeline, params, cv=5)
grid.fit(X, y)

The magic is in the syntax:

prep__num__with_mean

which reads like this:

pipeline → step 'prep' → block 'num' → parameter 'with_mean'

Double underscores (__) to drill down one level. With this you can tune the preprocessing, the model, or any step of the Pipeline, all in a single call and without leakage.

But watch out for the cost. GridSearch tries every combination, and that multiplies fast:

2 (with_mean) × 3 (C) × 5 (folds) = 30 trainings

Add a third parameter with 4 values and you’re already at 120. That’s the combinatorial explosion: with large grids, GridSearch becomes unaffordable.

When it finishes, the useful stuff is here:

python
grid.best_params_      # the best combination
grid.best_score_       # its CV score
grid.best_estimator_   # the Pipeline already refit on all of X

5. RandomizedSearchCV: smart tuning

GridSearch tries every combination. RandomizedSearch tries only some, but well chosen, sampling from distributions you define.

python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

params = {
    'model__C': uniform(0.01, 10)
}

search = RandomizedSearchCV(pipeline, params, n_iter=20, cv=5)
search.fit(X, y)

Advantages:

  • faster
  • more efficient
  • ideal for large spaces
  • avoids useless combinations

The key isn_iter: you decide how much budget to spend. Instead of trying everything, you set 20 attempts and let chance explore. In wide hyperparameter spaces, RandomizedSearch usually finds something almost as good as GridSearch in a fraction of the time.

Rule of thumb: small, discrete grid → GridSearch. Large or continuous space → RandomizedSearch.


6. Nested Cross-Validation: the definitive evaluation

Here’s the trap almost nobody sees: if you use the same CV to choose the hyperparameters and to report the result, you’re cheating without meaning to. You optimized against those folds, so that score is already contaminated, it’s optimistic.

The solution is nested CV:

  • outer CV → evaluates the model
  • inner CV → runs the GridSearch
python
from sklearn.model_selection import cross_val_score, KFold

outer = KFold(n_splits=5)
scores = cross_val_score(grid, X, y, cv=outer)

What happens under the hood: on each outer fold, thegridruns its own hyperparameter search using only that fold’s training data, and is then evaluated on the outer test set, which it never saw during tuning. The result: an honest score, with hyperparameters that weren’t tuned against the evaluation set.

It’s more expensive (a full GridSearch for every outer fold), but it’s what’s used in papers, benchmarks and serious production.


7. Advanced best practices

✔️ 1. Anything that learns from the data → inside the Pipeline. Scaler, PCA, imputation, encoding… everything. If it fits to the data, it goes inside.

✔️ 2. Never usefit_transformoutside the Pipeline. It’s the most common form of leakage, and the most silent.

✔️ 3. Always evaluate with CV. A single split isn’t enough. And look at thestd, not just the mean.

✔️ 4. Tune hyperparameters inside the Pipeline. Not outside. Preprocessing has parameters worth tuning too.

✔️ 5. Use the right metric.accuracyis useless with imbalanced classes.

✔️ 6. Save the full Pipeline, not the model. A model without preprocessing is worthless in production.


8. Final example: evaluation + tuning + production

python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

preprocess = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'salary']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'gender'])
])

pipeline = Pipeline([
    ('prep', preprocess),
    ('model', LogisticRegression())
])

params = {
    'model__C': [0.1, 1, 10]
}

grid = GridSearchCV(pipeline, params, cv=5, scoring='f1_macro')
grid.fit(X, y)

best_model = grid.best_estimator_

And you save it:

python
import joblib
joblib.dump(best_model, 'pipeline.pkl')

This is production-grade. Validated, tuned, with the right metric, and ready to deploy as a single object.


TL;DR

  • Proper validation is part of the design, not a final step.
  • Cross-Validation is mandatory for honest model evaluation, and thestdmatters as much as the mean.
  • Use StratifiedKFold and the right metric;accuracylies with imbalanced classes.
  • GridSearchCV tries everything (mind the combinatorial explosion); RandomizedSearchCV explores with a fixed budget.
  • Nested CV is the most rigorous evaluation: it separates choosing hyperparameters from reporting results.
  • All preprocessing must live inside the Pipeline.
  • What you deploy isn’t a model: it’s a complete Pipeline.