Ridge and Lasso Explained: How Regularization Tames Overfitting in Machine Learning (Part 6)


Regression (Part 6): Ridge & Lasso — Taming Overfitting

When your model has too much freedom, the solution isn’t to take it away — it’s to charge for using it.

In Part 5 we saw something powerful and something dangerous. The powerful part: by adding extra columns — x², x³, x⁴… — our model can learn curves with ease. The dangerous part: the more columns we add, the more the model can twist itself to fit every training point, including the noise.

The result is familiar: a degree‑10 polynomial that hits every training point perfectly… and then produces absurd predictions in between. Huge coefficients, wild oscillations, perfect training error, terrible validation error.

Our only defense so far was caution: keep the degree low, use cross‑validation, start simple. That’s like saying “drive slowly.” It works, but it limits how far you can go.

Regularization is the seat belt that lets the model run fast without losing control.

 

Comparison of unregularized, Ridge, and Lasso regression showing how each method affects curve smoothness and coefficient size

1. The real problem: coefficients that explode

When a model tries to chase noise, it needs to make sharp bends. Sharp bends require huge coefficients.

Typical symptoms:

  • coefficients with massive magnitudes
  • alternating signs
  • violent oscillations between data points
  • training error drops
  • validation error skyrockets

If we control the coefficients, we control overfitting.

2. The key idea: penalize complexity

Until now, the model had one goal:

Minimize the error (MSE)

Regularization adds a second goal:

Keep coefficients small

The new cost function is:

J(β)=error+λ⋅penalty
  • The first term wants to fit the data.
  • The second term wants to keep the model humble.

λ (lambda) is the dial that controls how strict we are.

Metaphor:

“I won’t forbid you from using x² or x³… but I’ll charge you for exaggerating.”

3. Ridge Regression (L2): smooth without removing

Ridge penalizes the sum of squared coefficients:

λ∑βj2

Effect:

  • big coefficients pay a heavy price
  • small coefficients pay a small price
  • all coefficients shrink, but none go to zero

Result:

  • smoother curves
  • no wild oscillations
  • all features remain in the model

Metaphor:

“Ridge is adding shock absorbers: the car stays powerful, but stops bouncing on every bump.”

4. Lasso Regression (L1): keep what matters, drop the rest

Lasso penalizes the sum of absolute values:

λ∑∣βj∣

Effect:

  • constant pressure toward zero
  • if a coefficient contributes little, it becomes exactly zero
  • the model performs automatic feature selection

Result:

  • smooth curves
  • simpler equations
  • fewer terms, more interpretability

Metaphor:

“Lasso is cleaning a closet: what you don’t use, goes out.”

5. Ridge vs Lasso (quick summary)

Method What it does Best use case
Ridge (L2) Shrinks coefficients When most features matter
Lasso (L1) Eliminates coefficients When only a few features matter
Elastic Net Mix of L1 + L2 When features are correlated

6. The role of λ: the discipline dial

  • λ = 0 → no regularization → chaos
  • small λ → slight smoothing
  • optimal λ → perfect balance
  • large λ → overly rigid model → underfitting

We choose λ using cross‑validation, just like we choose polynomial degree.

7. How training works (light version)

Gradient descent stays the same. Only the derivatives change:

Ridge:

∂J∂βj=usual gradient+2λβj

→ gentle pull toward zero.

Lasso:

∂J∂βj=usual gradient+λ⋅sign(βj)

→ constant push until the coefficient hits zero.

That’s all you need for intuition.

8. A visual example (conceptual)

Imagine the same degree‑10 polynomial:

  • No regularization: chaotic curve, huge coefficients.
  • Ridge: smooth curve, small coefficients.
  • Lasso: smooth curve, simple equation (only x and x² survive).

Three treatments, three outcomes. Only Ridge and Lasso tame the chaos.

9. When to use regularization

Use regularization when:

  • validation error ≫ training error
  • coefficients are huge
  • you have many features
  • you use high‑degree polynomials
  • features are correlated
  • the model is unstable

Skip regularization when:

  • you have lots of data and few features
  • the model already generalizes well
  • the problem is lack of expressiveness, not excess

10. In summary

  • Overfitting happens when coefficients explode.
  • Regularization penalizes that explosion.
  • Ridge shrinks coefficients.
  • Lasso eliminates coefficients.
  • λ controls how strict we are.
  • Training still uses gradient descent.
  • Regularization is the seat belt for flexible models.