4 junio 2026

Scikit-Learn (Part 2)

The full Pipeline workflow: how to automate your ML like a pro

In Part 1 we covered the three pillars of Scikit-Learn:

Estimators
Transformers
Predictors

They work, sure. But writing them by hand is fragile, repetitive and error-prone. And the moment your pipeline grows to 5, 8 or 12 steps… it becomes flat-out unmanageable.

That’s why Scikit-Learn ships one of the most elegant ideas in its entire API:

The Pipeline.

A single object that:

chains steps together
prevents leakage
keeps train and test consistent
makes validation easy
enables hyperparameter automation
and leaves you with a system ready for production

Today you’re going to build a real one. Not the typical two-line example you see in every tutorial.

Let’s go step by step.

What problem does a Pipeline solve?

Picture this manual flow:

python

scaler = StandardScaler()
pca = PCA(n_components=5)
model = LogisticRegression()

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca.fit(X_train_scaled)
X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

model.fit(X_train_pca, y_train)
y_pred = model.predict(X_test_pca)

It works. But it has serious problems:

If you change a step, you touch 6 lines.
If you add a step, you have to remember to apply it on train and test.
If you want cross-validation, you rewrite half the script.
If you want GridSearch, you nest functions by hand.
And one misplaced.fit_transform()→ silent data leakage.

A Pipeline wipes all of that out in one shot.

What is a Pipeline?

A Pipeline is a composite Estimator.

That means:

it has.fit()
it has.predict()
and internally it runs every step in the right order

python

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
    ('model', LogisticRegression())
])

Now your whole flow is a single object:

python

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

And Scikit-Learn takes care of:

calling.fit()only where it should
calling.transform()on every intermediate step
passing the data from one step to the next
keeping the order
avoiding leakage

You don’t write any of that. The library handles it.

The golden rule of the Pipeline

In a Pipeline:

Every step except the last must be a Transformer.
The last step must be a Predictor.

Valid example:

Scaler → PCA → LogisticRegression

Invalid example:

Scaler → LogisticRegression → PCA

Why? Because a Predictor doesn’t have.transform(), so it can’t pass data to the next step. The last link is always the one that decides.

How it works under the hood

When you call:

python

pipeline.fit(X_train, y_train)

Scikit-Learn runs internally:

scaler.fit(X_train)
X1 = scaler.transform(X_train)
pca.fit(X1)
X2 = pca.transform(X1)
model.fit(X2, y_train)

And when you call:

python

pipeline.predict(X_test)

It runs:

X1 = scaler.transform(X_test)
X2 = pca.transform(X1)
model.predict(X2)

Without you writing a single extra line. And with no chance of getting the order wrong.

ColumnTransformer: the secret weapon

In real projects, columns don’t all get the same treatment:

some need scaling
others need one-hot encoding
others need imputation for missing values
and some are left as-is

That’s exactly whatColumnTransformeris for.

python

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocess = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'salary']),
    ('cat', OneHotEncoder(), ['city', 'gender'])
])

And you drop it into the Pipeline as just another step:

python

pipeline = Pipeline([
    ('prep', preprocess),
    ('model', LogisticRegression())
])

Now your pipeline, in a single object:

scales the numerical features
encodes the categorical ones
trains the model
and stays ready to predict on new data with the exact same treatment

This is what separates a toy notebook from a real pipeline.

Cross-validation with Pipelines (the right way)

Before, when you did:

python

from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)

you were validating the model, but not the preprocessing. And if you scaled withfit_transformover the entireXbeforehand, leakage already happened the moment you did it.

With a Pipeline the problem disappears:

python

cross_val_score(pipeline, X, y, cv=5)

Now, on every fold, Scikit-Learn:

refits the scaler only on the fold’s training data
refits the PCA only on the fold’s training data
refits the model only on the fold’s training data
evaluates on the fold’s validation set

Honest validation. No leakage. No pain.

GridSearchCV with Pipelines

This is where Pipelines really shine.

python

from sklearn.model_selection import GridSearchCV

params = {
    'prep__num__with_mean': [True, False],
    'model__C': [0.1, 1, 10]
}

grid = GridSearchCV(pipeline, params, cv=5)
grid.fit(X, y)

The syntaxprep__num__with_meanreads like this:

pipeline → step 'prep' → block 'num' → parameter 'with_mean'

Double underscores (__) mean «drill down one level». With this you can tune any parameter of any step, preprocessing included, all in a single call.

And the best part: every combination is evaluated with honest cross-validation, because the refitting happens inside each fold.

Pipeline ready for production

Once it’s trained, you save it:

python

import joblib
joblib.dump(pipeline, 'model.pkl')

And in production all you need is:

python

pipeline = joblib.load('model.pkl')
pipeline.predict(new_data)

No need to worry about:

scaling
encoding
step order
consistency between training and production

Everything lives inside the Pipeline. That’s exactly the point: what you deploy is not a model, it’s the entire workflow.

Common mistakes (and how to avoid them)

❌ Mistake 1: Calling`.fit()`outside the Pipeline

python

scaler.fit(X_train)

The moment you do this on your own, you’ve broken the whole idea of the Pipeline. Let the object handle it.

❌ Mistake 2: Mixing`.fit_transform()`with Pipelines

Never mix manual flows with Pipelines. Either everything is inside, or everything is outside. Anything else is just begging for leakage.

❌ Mistake 3: Processing columns by hand instead of using`ColumnTransformer`

If you split numerical and categorical features with manual slicing, you lose modularity and reproducibility. And one day you’ll pay for it.

❌ Mistake 4: Running GridSearch without a Pipeline

If preprocessing lives outside the object you’re validating, leakage is almost guaranteed. The rule is simple: anything learned from the data goes inside the Pipeline.

Final example: a complete, real-world Pipeline

python

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

preprocess = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'salary']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'gender'])
])

pipeline = Pipeline([
    ('prep', preprocess),
    ('model', LogisticRegression())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

This is production-grade. Modular, validatable, serializable, and leakage-free. Exactly what we’ve been working toward since Part 1.

TL;DR

A Pipeline chains steps together and prevents leakage automatically.
Every step except the last must be a Transformer; the last one must be a Predictor.
ColumnTransformerlets you apply different transformations to different columns.
Cross-validation and GridSearch only work properly when preprocessing lives inside the Pipeline.
A Pipeline is the object you save withjobliband deploy to production.
Once you really understand the Pipeline, you stop writing «ML scripts» and start building ML systems.

Machine Learning, AI, Artificial Intelligence, General, IA, Inteligencia Artificial

| Tags: AI, artificial, ia, inteligencia, Inteligencia artificial, learning, machine, Machine Learning

De On-Premise a la Nube in English

Scikit‑Learn Pipeline Guide: Build Production‑Ready ML Workflows (Part 2)

Scikit-Learn (Part 2)

The full Pipeline workflow: how to automate your ML like a pro

What problem does a Pipeline solve?

What is a Pipeline?

The golden rule of the Pipeline

How it works under the hood

ColumnTransformer: the secret weapon

Cross-validation with Pipelines (the right way)

GridSearchCV with Pipelines

Pipeline ready for production

Common mistakes (and how to avoid them)

❌ Mistake 1: Calling`.fit()`outside the Pipeline

❌ Mistake 2: Mixing`.fit_transform()`with Pipelines

❌ Mistake 3: Processing columns by hand instead of using`ColumnTransformer`

❌ Mistake 4: Running GridSearch without a Pipeline

Final example: a complete, real-world Pipeline

TL;DR

Deja una respuesta Cancelar la respuesta

De On-Premise a la Nube in English

Scikit-Learn (Part 2)

The full Pipeline workflow: how to automate your ML like a pro

What problem does a Pipeline solve?

What is a Pipeline?

The golden rule of the Pipeline

How it works under the hood

ColumnTransformer: the secret weapon

Cross-validation with Pipelines (the right way)

GridSearchCV with Pipelines

Pipeline ready for production

Common mistakes (and how to avoid them)

❌ Mistake 1: Calling.fit()outside the Pipeline

❌ Mistake 2: Mixing.fit_transform()with Pipelines

❌ Mistake 3: Processing columns by hand instead of usingColumnTransformer

❌ Mistake 4: Running GridSearch without a Pipeline

Final example: a complete, real-world Pipeline

TL;DR

Deja una respuesta Cancelar la respuesta

❌ Mistake 1: Calling`.fit()`outside the Pipeline

❌ Mistake 2: Mixing`.fit_transform()`with Pipelines

❌ Mistake 3: Processing columns by hand instead of using`ColumnTransformer`