Regularizers

Regularization = anything that constrains a model to fight overfitting — it trades a little more bias for a lot less variance (see Bias-Variance Tradeoff). The classic form adds a penalty on the size of the weights to the loss, so the optimiser prefers simpler (smaller-weight) solutions.

λ controls the strength: λ = 0 → no regularization (overfit risk); large λ → strong shrinkage (underfit risk).

L1 (Lasso)

L1 (Lasso): takes the absolute value of the weights — penalty R(w) = Σ |wᵢ|.

  • Can exclude useless variables from equations: it drives some weights exactly to 0sparse models, automatic feature selection.
  • Geometric reason: the L1 “diamond” constraint region has corners on the axes, so the optimum often lands on an axis (a zero weight).

L2 (Ridge)

L2 (Ridge): squares the weights — penalty R(w) = Σ wᵢ².

  • Shrinks all weights smoothly toward 0 but rarely exactly to 0 → keeps all features, just small.
  • This is the same thing as weight decay in neural networks.
L1 (Lasso)L2 (Ridge)
PenaltyΣ|wᵢ|Σwᵢ²
Effect on weightssome → exactly 0 (sparse)all shrunk, rarely 0
Use whenyou want feature selectionyou want stable shrinkage

Beyond L1/L2 — regularization in deep learning

The same goal (less variance / less overfitting) shows up in Neural Networks & Deep Learning as:

  • Weight decay = L2 on the network weights.
  • Dropout — randomly deactivate neurons during training (trains an implicit ensemble of sub-networks).
  • Early stopping — stop when validation loss starts rising.

And in Support Vector Machines, the soft-margin C parameter is a regularizer: small C → wider margin, more regularization; large C → fewer violations, risk of overfitting.

See also

Quellen

Erstellt: 29-01-25 17:16