
Data Leakage: The Silent Enemy That Destroys Your Machine Learning Models
Data leakage is one of the most common and destructive issues in real-world ML systems.
Today we’re diving into one of the most dangerous and least visible problems in real-world ML: data leakage.
A fraud prediction model with 99.8% accuracy in validation. The team celebrates. It’s deployed to production. Result: 52% accuracy.
What happened?
A column calledtransaction_reversedthat only existed after detecting the fraud. Pure data leakage.
If you’ve ever trained a model that seemed perfect in validation but collapsed in production, this phenomenon was likely the culprit.
Data leakage doesn’t throw warnings. It doesn’t break your code. It doesn’t generate exceptions.
But it destroys a model’s ability to generalize. And worse: it can go unnoticed even by experienced teams.
What exactly is data leakage?
Data leakage occurs when information from the future or from the validation set leaks into model training.
In other words, the model learns patterns that won’t exist in a real-world scenario.
That’s why, during validation:
- Accuracy seems sky-high
- AUC shoots up
- Loss drops dramatically
But in production:
- Performance collapses
- Predictions become inconsistent
- The model stops being useful
In other words: the model «cheats» without us realizing it.

The 4 most common types of data leakage
1. Temporal leakage
The most dangerous and most frequent.
It occurs when the model uses information from the future to predict the past.
Concrete example:
Imagine predicting whether a customer will buy in March using the featureavg_spending_quarter_1. If that quarter includes March, you’re using the future to predict the present.
The model learns that «high Q1 spending = March purchase» because March is within Q1. In production, when you predict in March, you don’t have the complete quarter’s spending yet.
Another typical case:
# ❌ Incorrect
df['avg_purchase_next_30_days'] = df.groupby('user_id')['amount'].rolling(30).mean()
# ✅ Correct
df['avg_purchase_last_30_days'] = df.groupby('user_id')['amount'].shift(1).rolling(30).mean()
2. Preprocessing leakage
Happens when we apply transformations before splitting the data.
Dangerous operations:
- Normalization
- Standardization
- Missing value imputation
- Feature selection
- Categorical variable encoding
If these operations are performed on the entire dataset, the model «sees» information from the validation set.
❌ Incorrect:
# Normalize the ENTIRE dataset first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)
✅ Correct:
# Split first, transform after
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit_transform on train
X_test_scaled = scaler.transform(X_test) # only transform on test
The difference: in the correct approach, the scaler only «learns» from the training data.
3. Highly correlated variable leakage
Variables that indirectly contain the answer.
Red flags for suspicious variables:
- Names that include «final», «result», «current_status», «cancelled»
- Correlation > 0.95 with the target
- Values that only exist after the target event
- Columns that a human wouldn’t have available at prediction time
Real example:
Predicting churn using acancellation_datecolumn. If that column has a value, the customer has already churned. It’s like predicting if it will rain by checking if the ground is wet.
4. Duplicate or related record leakage
Very common in:
- Medical data (same patient, multiple visits)
- Transactions (same user, multiple purchases)
- Time series
- Datasets with multiple rows per entity
If the same user appears in both train and test, the model memorizes patterns specific to that user instead of learning general patterns.
Solution:
# Split by entity, not by rows
from sklearn.model_selection import GroupShuffleSplit
splitter = GroupShuffleSplit(test_size=0.2, random_state=42)
train_idx, test_idx = next(splitter.split(X, y, groups=df['user_id']))
How to detect data leakage
Warning signs
- Suspiciously high accuracy (>95% on complex real-world problems)
- One variable dominates the model (feature importance >80%)
- Performance collapses in production (>15% drop)
- Cross-validation gives highly inconsistent results
- The model works «too well» with very little data
Quick test
Remove the top 3 most important variables from your model.
If performance collapses completely (drops >50%), investigate those variables thoroughly. They likely contain leakage.
Key questions for each feature
- Would this variable exist at prediction time?
- Does it contain information from the future?
- Is it a direct proxy for the label?
- How was it calculated? Does it include data after the event?
How to avoid it: the golden rules
1. Split first, transform after
The correct order:
- Train/Test split
- Preprocessing fitted only on train
- Apply transformations to test using parameters from train
Use pipelines to automate it:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
# The pipeline guarantees the correct order
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
# fit only sees training data
pipeline.fit(X_train, y_train)
# predict uses parameters learned from train
y_pred = pipeline.predict(X_test)
2. Respect temporality
In time series:
❌ Never do this:
from sklearn.model_selection import KFold
# KFold mixes past and future
kfold = KFold(n_splits=5, shuffle=True)
✅ Do this:
from sklearn.model_selection import TimeSeriesSplit
# TimeSeriesSplit respects temporal order
tscv = TimeSeriesSplit(n_splits=5)
Rules for time series:
- Never mix past and future
- Use chronological splits
- Features with rolling windows: ensure they don’t include the point to predict
- Always sort by time before splitting
3. Review variables with business sense
Don’t blindly trust feature importance. Ask yourself:
- Would this variable exist at prediction time?
- How is it collected in reality?
- Is there a delay between the event and data availability?
Real e-commerce example:
Predicting if a user will buy tomorrow usingitems_in_cart_tomorrow. Obviously, tomorrow hasn’t arrived yet.
4. Avoid duplicates between train and test
Especially critical in:
- User data (split by
user_id) - Application logs (split by
session_id) - Financial transactions (split by
customer_id) - Clinical data (split by
patient_id)
5. Validate with «out-of-time» data
If your model will be deployed in January 2027, validate it with December 2026 data that you didn’t use in training.
Simulate the real scenario: train on the past, predict the future.
Real cases with solutions
Case 1: Churn prediction
❌ Incorrect:
df['days_since_last_activity'] = (datetime.now() - df['last_activity']).days
✅ Correct:
# Use a fixed cutoff date (the date when you'd make the prediction)
cutoff_date = '2026-01-01'
df['days_since_last_activity'] = (cutoff_date - df['last_activity']).days
Case 2: Missing value imputation
❌ Incorrect:
# Impute with global mean
df['age'].fillna(df['age'].mean(), inplace=True)
X_train, X_test = train_test_split(df)
✅ Correct:
X_train, X_test = train_test_split(df)
# Calculate mean only from train
mean_age = X_train['age'].mean()
# Apply to both sets
X_train['age'].fillna(mean_age, inplace=True)
X_test['age'].fillna(mean_age, inplace=True)
Case 3: Normalization with StandardScaler
❌ Incorrect:
scaler = StandardScaler()
df[['feature1', 'feature2']] = scaler.fit_transform(df[['feature1', 'feature2']])
X_train, X_test = train_test_split(df)
✅ Correct:
X_train, X_test = train_test_split(df)
scaler = StandardScaler()
X_train[['feature1', 'feature2']] = scaler.fit_transform(X_train[['feature1', 'feature2']])
X_test[['feature1', 'feature2']] = scaler.transform(X_test[['feature1', 'feature2']])
Case 4: Feature engineering with aggregations
❌ Incorrect:
# Aggregate all user transactions (including future ones)
df['user_total_spent'] = df.groupby('user_id')['amount'].transform('sum')
✅ Correct:
# Only aggregate transactions before the cutoff date
df = df.sort_values('transaction_date')
df['user_total_spent'] = df.groupby('user_id')['amount'].cumsum().shift(1)
Checklist before training your model
Use this checklist before each training session:
- Did I split the data BEFORE any transformation?
- Would all my features exist at prediction time?
- Did I use
.fit_transform()only on train and.transform()on test? - Did I verify there are no duplicates or shared entities between train/test?
- Are the metrics «too good to be true»?
- Did I review the top 5 most important features with business sense?
- Did I use
TimeSeriesSplitif working with time series? - Did I implement everything in a Pipeline to avoid manual errors?
- Did I validate with data completely outside the training period?
Tools that help prevent leakage
Pipelines and automation:
sklearn.pipeline.Pipeline– guarantees correct operation orderfeature-engine– transformers that respect train/test splitcategory_encoders– safe encoding of categorical variables
Temporal validation:
sklearn.model_selection.TimeSeriesSplit– for time seriessklearn.model_selection.GroupShuffleSplit– for grouped data
Detection:
- Analyze feature distribution between train/test
- Compare feature importance with business knowledge
- Monitor production vs validation performance
Conclusion
Data leakage is one of the most critical errors in Machine Learning because:
- It’s not obvious at first glance
- It produces misleading metrics
- It completely invalidates a model in production
- It’s extremely common, even in experienced teams
The good news: it can be avoided by following solid practices.
The golden rule: If your metrics seem too good, they probably are. Investigate.
A model without leakage is a model that can survive in production. And that’s the only metric that truly matters.
Additional resources:
scikit-learn.org
- Paper: «Leakage in Data Mining: Formulation, Detection, and Avoidance» (Kaufman et al., 2012)
feature-engine.readthedocs.io
Have you detected data leakage in your projects? What other types have you encountered? Share your experience in the comments.
«A model that generalizes is a model that lives.
And data leakage is its most silent enemy.»


