General Machine Learning Workflow


General Machine Learning Workflow: From Problem to Model

After understanding what Machine Learning is and its essential components, it’s time to see how a model is actually built, step by step.

An ML project isn’t magic: it’s an iterative process where every decision affects the final outcome. In this article, I’ll break down each phase with practical examples and tips to help you avoid common mistakes.

1. Define the Problem

Everything starts with a clear question. This phase determines the success or failure of the entire project.

Key questions:

  • What do we want to predict or classify? (e.g., customer churn probability, housing price)
  • What is the target variable? (churn yes/no, price in dollars)
  • What business impact does it have? (reduce costs by 15%, increase retention by 20%)
  • What constraints exist? (response time < 100ms, interpretable model for legal compliance)

Practical example:
Instead of «we want to predict sales,» define: «we need to predict weekly demand for each product in each store 7 days in advance, with an error below 10%, to optimize inventory and reduce stockouts by 25%.»

Warning sign: If you can’t explain the problem in 2-3 concrete sentences, you need more clarity before continuing.

Conceptual diagram of the Machine Learning lifecycle: problem, data, features, model, evaluation, deployment, and monitoring, with continuous improvement loop

2. Collect and Explore the Data

Before training anything, you need to deeply understand the data. This phase typically reveals more than 50% of the project’s challenges.

Essential exploratory data analysis (EDA):

Origin and quality:

  • Where does it come from? (databases, APIs, sensors, forms)
  • How up-to-date is it?
  • Are there collection biases? (e.g., only data from users who accepted cookies)

Distributions and patterns:

  • Histograms of numerical variables
  • Category frequencies
  • Correlations between variables
  • Temporal trends

Common problems:

  • Missing values (random or systematic?)
  • Outliers (errors or legitimate cases?)
  • Class imbalance (95% negative, 5% positive)
  • Information leakage (variables that won’t be available in production)

Useful tools: pandas-profiling, sweetviz, matplotlib, seaborn

3. Prepare and Clean the Data

Here you transform the dataset into something usable. Without clean data, there’s no reliable model.

Main tasks:

Missing values:

  • Deletion (if < 5% of rows affected)
  • Simple imputation (mean, median, mode)
  • Advanced imputation (KNN, predictive models)
  • Create «missing value» indicator as additional feature

Duplicates and outliers:

  • Identify exact or near-duplicate records
  • Decide if outliers are errors (remove) or valid extreme cases (keep)
  • Techniques: IQR, Z-score, Isolation Forest

Strategic split:

  • Train (60-70%): to train the model
  • Validation (15-20%): to tune hyperparameters
  • Test (15-20%): for final evaluation, never touched during development

Important: The split must respect the temporal nature of the data (don’t mix future with past) and maintain similar distributions.

4. Feature Engineering

The goal is to transform data into variables useful for the algorithm. This is frequently the phase that has the most impact on final performance.

4.1 Selecting Relevant Variables

Feature selection is crucial: too many irrelevant variables add noise and increase overfitting risk, while too few may lose valuable information. The goal is to find the optimal subset that maximizes predictive performance.

Statistical Techniques

These techniques evaluate the relationship between each feature and the target variable independently.

Correlation (numerical variables)

Measures the strength and direction of the linear relationship between two variables.

  • Pearson Correlation: Assumes linear relationship and normal distributions. Range [-1, 1] where -1 is perfect negative correlation, 0 is independence, and 1 is perfect positive correlation.
    • Example: Correlation between house size and price = 0.75 (strong positive relationship)
    • Limitation: Doesn’t detect non-linear relationships. If the relationship is quadratic (Y = X²), Pearson may give correlation close to 0
  • Spearman Correlation: Based on ranks, doesn’t assume linearity or normality. More robust to outliers.
    • When to use: Ordinal variables, skewed distributions, or monotonic non-linear relationships
    • Example: Relationship between satisfaction ranking (1-5 stars) and repurchase probability

Rule of thumb: Remove features with absolute correlation < 0.1 with the target. Also remove features with correlation between each other > 0.9 (multicollinearity) keeping the one most correlated with the target.

ANOVA (Analysis of Variance)

Evaluates whether the means of a numerical variable differ significantly between groups of a categorical variable.

  • How it works: Compares variance between groups vs within groups. If variance between groups is much larger, the categorical variable is relevant.
  • F-statistic: Ratio of variances. High values indicate greater difference between groups.
  • p-value: If p < 0.05, we reject the null hypothesis (means are equal) → the feature is relevant

Practical example:

text
Categorical variable: Customer type (Basic, Premium, VIP)
Numerical variable: Average monthly spending

Basic: mean = $50
Premium: mean = $150
VIP: mean = $500

ANOVA → High F-statistic, p-value < 0.001
Conclusion: Customer type is highly relevant for predicting spending

Chi-square (χ²)

Evaluates independence between two categorical variables using a contingency table.

  • How it works: Compares observed vs expected frequencies under independence. Large differences indicate dependence.
  • p-value: If p < 0.05, the variables are related

Practical example:

text
Categorical variable 1: Gender (M/F)
Categorical variable 2: Churn (Yes/No)

Contingency table:
           Churn=Yes  Churn=No
Gender=M      120        380
Gender=F      200        300

Chi-square → p-value = 0.002
Conclusion: Gender and churn are related

Mutual Information

Measures how much information one variable provides about another, capturing both linear and non-linear relationships.

  • Main advantage: Detects any type of dependency, not just linear
  • Range: [0, ∞) where 0 = total independence, high values = strong dependence
  • Normalization: Can be normalized to [0, 1] by dividing by entropy

Example where it outperforms correlation:

text
X = [-2, -1, 0, 1, 2]
Y = [4, 1, 0, 1, 4]  (quadratic relationship: Y = X²)

Pearson Correlation: ~0 (doesn't detect the relationship)
Mutual Information: high (detects the perfect dependency)

When to use each technique:

Variable types Recommended technique
Numerical → Numerical (linear) Pearson
Numerical → Numerical (non-linear) Spearman, Mutual Information
Categorical → Numerical ANOVA, Mutual Information
Categorical → Categorical Chi-square, Mutual Information

Model-Based Techniques

These techniques evaluate feature importance in the context of a predictive model, considering interactions between variables.

Feature Importance with Trees

Tree-based models (Random Forest, XGBoost, LightGBM) calculate the importance of each feature during training.

Two types of importance:

  1. Impurity-based (Gini/Entropy): Measures how much each feature reduces impurity when making splits
    • Advantage: Fast to calculate
    • Disadvantage: Biased toward features with high cardinality (many unique values)
  2. Permutation importance: Measures how much the model worsens if we randomize the values of a feature
    • Advantage: More reliable, unbiased
    • Disadvantage: More computationally expensive

Practical example with Random Forest:

python
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# Train model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

# Impurity-based importance
importances_gini = rf.feature_importances_

# Permutation importance (more reliable)
perm_importance = permutation_importance(rf, X_val, y_val, n_repeats=10)
importances_perm = perm_importance.importances_mean

# Example result:
# age: 0.25
# income: 0.20
# purchase_history: 0.18
# favorite_color: 0.02  ← candidate for removal

Interpretation: Features with importance < 0.01 are usually noise. Remove them and retrain to validate that performance doesn’t worsen.

L1 Regularization (Lasso)

L1 regularization adds a penalty proportional to the absolute value of coefficients, forcing some to exactly zero.

How it works:

text
Loss function = Prediction error + λ × Σ|coefficients|

λ (lambda): controls regularization strength
- λ = 0: no regularization (normal linear regression)
- High λ: more coefficients become 0 (more features eliminated)

Advantages:

  • Automatic feature selection (coefficients = 0 → irrelevant feature)
  • Produces sparse models (few active features)
  • Useful when there are many correlated features

Practical process:

  1. Train Lasso with different λ values (using cross-validation)
  2. Select λ that maximizes validation performance
  3. Features with coefficient = 0 are eliminated
  4. Retrain final model only with selected features

Example:

python
from sklearn.linear_model import LassoCV

# LassoCV automatically tests multiple alpha (λ) values
lasso = LassoCV(cv=5, alphas=np.logspace(-4, 1, 50))
lasso.fit(X_train, y_train)

# Selected features (coefficient != 0)
selected_features = X_train.columns[lasso.coef_ != 0]
print(f"Selected features: {len(selected_features)} out of {X_train.shape[1]}")

# Example result:
# From 50 original features → 12 selected
# Performance: R² = 0.85 (vs 0.87 with all features)
# Gain: simpler model, less overfitting, more interpretable

RFE (Recursive Feature Elimination)

Iterative selection that trains the model repeatedly, eliminating the least important features at each step.

Algorithm:

  1. Train model with all features
  2. Calculate importance/coefficients of each feature
  3. Eliminate the least important feature(s)
  4. Repeat until reaching the desired number of features

Advantages:

  • Considers interactions between features (unlike univariate methods)
  • Works with any model that provides importance or coefficients
  • Allows specifying exactly how many features you want

Disadvantages:

  • Computationally expensive (trains the model many times)
  • Can be unstable with small datasets

Practical example:

python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# We want to select the 10 best features
estimator = LogisticRegression()
rfe = RFE(estimator, n_features_to_select=10, step=1)
rfe.fit(X_train, y_train)

# Selected features
selected_features = X_train.columns[rfe.support_]

# Feature ranking (1 = best)
feature_ranking = pd.DataFrame({
    'feature': X_train.columns,
    'ranking': rfe.ranking_
}).sort_values('ranking')

# Example result:
# Ranking 1: age, income, purchase_history (selected)
# Ranking 15: favorite_color, zodiac_sign (eliminated first)

Advanced variant – RFECV: Uses cross-validation to automatically determine the optimal number of features.

python
from sklearn.feature_selection import RFECV

rfecv = RFECV(estimator, step=1, cv=5, scoring='accuracy')
rfecv.fit(X_train, y_train)

print(f"Optimal number of features: {rfecv.n_features_}")
# Result: 15 features (maximizes accuracy in cross-validation)

Practical Selection Strategy

In real projects, combine multiple techniques:

Phase 1 – Quick filtering (statistical techniques):

  • Remove features with correlation < 0.05 with target
  • Remove features with p-value > 0.05 in ANOVA/Chi-square
  • Reduce from 100 features to ~40

Phase 2 – Model-based selection:

  • Train Random Forest, observe importance
  • Apply Lasso for automatic selection
  • Reduce from 40 features to ~20

Phase 3 – Refinement with RFE:

  • Use RFECV to find optimal number
  • Validate that performance is maintained or improves
  • Final result: 12-15 key features

Phase 4 – Business validation:

  • Review selected features with domain experts
  • Remove features that won’t be available in production
  • Consider cost of obtaining each feature

Success signal: Model with 20% of original features that maintains 95%+ of the performance of the model with all features.

4.2 Variable Transformation

Category encoding:

  • One-Hot Encoding: for unordered categories (color: red, blue, green)
  • Label Encoding: for ordinal categories (size: S, M, L, XL)
  • Target Encoding: uses the mean of the target per category (useful with high cardinality)
  • Frequency Encoding: replaces with frequency of occurrence

Numerical variable scaling:

  • StandardScaler: mean 0, standard deviation 1 (assumes normal distribution)
  • MinMaxScaler: range [0,1] (sensitive to outliers)
  • RobustScaler: uses median and quartiles (robust to outliers)

Mathematical transformations:

  • Logarithmic: for variables with skewed distribution (income, prices)
  • Polynomial: to capture non-linear relationships
  • Box-Cox: automatic normalization of distributions

Dimensionality reduction:

  • PCA: linear projection that maintains variance
  • t-SNE/UMAP: for visualization
  • Embeddings: dense representations of high-dimensional categories

4.3 Creating New Features

Practical examples:

Ratios and proportions:

  • Debt-to-income ratio in credit scoring
  • Conversion rate by marketing channel
  • Discount percentage on original price

Temporal aggregations:

  • Sales last 7/30/90 days
  • Moving average of temperature
  • Trend (increasing/decreasing) in recent weeks

Interactions between variables:

  • Age × Income
  • Area × Location (for housing prices)
  • Time of day × Day of week (for transportation demand)

Derived signals:

  • Extraction of day of week, month, quarter from dates
  • Geographic distance between two points
  • Time since last event (days since last purchase)

Golden rule: Each feature should have a business hypothesis behind it. Don’t create features «just in case.»

5. Model Selection

There’s no «universal best model.» The choice depends on the problem and context.

By task type:

Classification:

  • Logistic Regression (interpretable baseline)
  • Random Forest (robust, doesn’t require much tuning)
  • XGBoost/LightGBM (high performance, competitions)
  • Neural networks (massive data, complex patterns)

Regression:

  • Linear Regression (baseline)
  • Ridge/Lasso (with regularization)
  • Gradient Boosting (XGBoost, LightGBM)
  • SVR (small data, non-linear)

Time series:

  • ARIMA/SARIMA (classical models)
  • Prophet (trends and seasonality)
  • LSTM/GRU (complex patterns)

Clustering:

  • K-Means (fast, spherical)
  • DBSCAN (arbitrary shapes, detects outliers)
  • Hierarchical (hierarchical exploration)

Decision factors:

Factor Simple model Complex model
Interpretability ✓ Linear regression, trees ✗ Deep learning, ensembles
Data volume ✓ < 10K records ✓ > 100K records
Training time ✓ Seconds/minutes ✗ Hours/days
Production latency ✓ < 10ms Variable

Practical advice: Always start with a simple model as baseline. If Logistic Regression gives 85% accuracy, you know any complex model must exceed that figure to justify its use.

6. Training

Here the model learns patterns from the data. The goal is to generalize, not memorize.

Training process:

  1. Hyperparameter tuning:
    • Learning rate, tree depth, number of neurons
    • Use Grid Search or Random Search to explore combinations
    • Bayesian Optimization for large spaces
  2. Cross-validation:
    • K-Fold (typically k=5 or k=10)
    • Stratified K-Fold (maintains class proportion)
    • Time Series Split (respects temporal order)
  3. Regularization:
    • L1 (Lasso): eliminates features
    • L2 (Ridge): reduces weight magnitude
    • Dropout (neural networks): randomly deactivates neurons
    • Early stopping: stops training when validation worsens
  4. Overfitting control:
    • Monitor train vs validation loss
    • Use more training data
    • Simplify the model
    • Increase regularization

Overfitting signal: Train accuracy 98%, validation accuracy 75% → the model memorizes instead of learning.

7. Evaluation

Each task requires specific metrics. Choosing the wrong metric can lead to optimizing the wrong model.

Binary classification:

  • Accuracy: % of correct predictions (useful only if classes are balanced)
  • Precision: of those I predicted positive, how many actually were? (minimizes false positives)
  • Recall: of the actual positives, how many did I detect? (minimizes false negatives)
  • F1-Score: harmonic mean of precision and recall
  • AUC-ROC: discrimination capacity across all thresholds

Example: In fraud detection (1% positive cases), a model that always predicts «no fraud» has 99% accuracy but is useless. Here recall matters more.

Regression:

  • MAE (Mean Absolute Error): average error in original units
  • RMSE (Root Mean Squared Error): penalizes large errors more
  • MAPE: percentage error (useful for comparing across scales)
  • R²: proportion of variance explained

Confusion matrix: Visualizes where the model fails (false positives vs false negatives).

Learning curves: Show if you need more data, more features, or a more complex model.

8. Deployment

A model that doesn’t reach production doesn’t generate impact. Deployment transforms experimental code into a robust service.

Deployment options:

REST APIs:

  • Flask/FastAPI for Python
  • Docker containers for portability
  • Horizontal scaling with Kubernetes

Batch processing:

  • Scheduled predictions (daily, weekly)
  • Useful when real-time response isn’t needed
  • Tools: Airflow, Luigi, Prefect

Edge deployment:

  • Models on devices (mobile, IoT)
  • Requires size optimization (quantization, pruning)
  • TensorFlow Lite, ONNX

Critical considerations:

  • Latency: How long can the prediction take? (< 100ms for web, < 1s for batch)
  • Versioning: Maintain multiple model versions for rollback
  • Monitoring: Logs of predictions, response times, errors
  • A/B testing: Compare new vs old model with real traffic

Infrastructure: AWS SageMaker, Google Vertex AI, Azure ML, or open-source solutions like MLflow, Kubeflow.

9. Monitoring and Maintenance

Models age. The world changes, and your model must adapt.

Types of degradation:

Data drift:

  • Input distributions change
  • Example: pandemic alters purchasing patterns
  • Detection: compare train vs production distributions (KS test, PSI)

Concept drift:

  • The relationship between X and Y changes
  • Example: new competitors change customer behavior
  • Detection: monitor performance metrics over time

Behavioral changes:

  • Users learn to «game» the model
  • Example: spammers adapt messages to avoid filters

Maintenance strategies:

  1. Continuous monitoring:
    • Dashboard with key metrics (accuracy, latency, volume)
    • Automatic alerts when metrics fall below thresholds
  2. Periodic retraining:
    • Scheduled (monthly, quarterly)
    • Triggered (when performance drops X%)
    • With recent data to capture new patterns
  3. Input data validation:
    • Detect values outside expected range
    • Reject predictions when confidence is low
  4. Feedback loop:
    • Collect actual labels from predictions
    • Use to evaluate real performance
    • Incorporate into next training cycle

Tools: Evidently AI, Whylabs, Fiddler, or custom solutions with Prometheus + Grafana.

Conclusion

Building a model is easy. Building a Machine Learning system that works in the real world is a completely different story.

Understanding the full workflow — problem definition, data preparation, feature engineering, model selection, training, evaluation, and monitoring — is what separates a junior technician from an ML engineer capable of delivering end‑to‑end value.

Mastering this process allows you not only to train models, but to build real, robust, and business‑aligned solutions.

Intro to the Next Post

In the next article of this series, I will dive into the complete lifecycle of a model in production (basic MLOps) and how to keep it stable, reliable, and performant over time.