Gradient Descent

methods-of-ai

Gradient Descent (GD) is the workhorse optimizer of modern AI. It’s the continuous version of Hill Climbing: at each step, move in the direction of steepest descent (the negative gradient of the loss function). Every neural network ever trained — GPT-4, Claude, AlphaFold, Stable Diffusion — was optimized with some variant of gradient descent.

GD beats Hill Climbing whenever you have a differentiable loss function: instead of evaluating all neighbors to find the best, you compute one number (the gradient) that tells you both direction and magnitude of the steepest descent for free. This is exponentially more efficient than search-based optimization in high dimensions.

The update rule fits on one line:

θ ← θ − η · ∇L(θ)

where θ are the parameters, ∇L(θ) is the gradient of the loss, and η is the learning rate. That’s the entire algorithm. Everything else (SGD, momentum, Adam, AdamW) is a refinement of this update.

Why it beats Hill Climbing

Hill ClimbingGradient Descent
Evaluates all (or many) neighbors to find the bestComputes one vector (the gradient) that points in the steepest direction
Cost per step: O(neighborhood size) — exponential in dimensionCost per step: O(d) — linear in dimension
Works in any search space (discrete, continuous, weird)Requires differentiable loss function
Stops at local optimaSame problem — but tricks exist (momentum, restarts, SGD noise)

In 1 dimension: same idea. In 1000 dimensions: HC needs ≥1000 evaluations per step; GD needs 1 (the gradient computation, via Gradient Backpropagation). This is why every neural network uses GD, not HC.

The variants — what gets refined and why

VariantWhat changesWhy
GD (Batch GD)Use the full dataset to compute ∇L each stepMost accurate gradient, but slow per step on large data
SGD (Stochastic GD)Use 1 sample per stepFast per step; noisy → can escape some local minima
Mini-batch SGDUse a batch (e.g. 32, 256) per stepSweet spot — fast + reasonably accurate gradient. Default in deep learning.
MomentumUpdate incorporates a fraction of the previous update: v ← βv + ∇L; θ ← θ − ηvAccelerates in consistent directions; dampens oscillation in narrow valleys
NesterovLook-ahead momentum: compute gradient at θ − βv, not θSlightly faster convergence than plain momentum
AdaGradPer-parameter learning rate scaled by historical gradient magnitudeGood for sparse features; LR decays too fast in long training
RMSpropAdaGrad with exponential moving average of squared gradients (avoids LR collapse)Tieleman & Hinton, 2012; precursor to Adam
AdamRMSprop + momentum (1st + 2nd moments of gradient)Default optimizer in deep learning (Kingma 2014)
AdamWAdam with decoupled weight decay (proper L2 regularization)Loshchilov 2017; current default in transformer training

⚠️ Exam-relevant intuition: Adam combines two ideas:

  1. Momentum (1st moment) — smooths out the direction
  2. Adaptive learning rate (2nd moment) — auto-tunes step size per parameter

That’s why Adam is so robust to learning-rate choice — it auto-tunes.

See it in code — 4 optimizers on the same loss landscape

Classic visualization: contour plot of a non-trivial 2D loss + trajectories of GD, Momentum, Nesterov, Adam from the same starting point.

What to see:

  • GD (red) zig-zags in the narrow direction (θ₂) — classic ravine behavior. Slow.
  • Momentum (blue) dampens the zigzag — uses past updates to smooth the trajectory. Faster.
  • Nesterov (green) very similar to momentum but slightly more decisive (look-ahead).
  • Adam (purple) auto-scales per-dimension — handles the elongated bowl gracefully even with a much larger learning rate.
  • Log-scale loss panel makes the speed differences obvious: Adam reaches 10⁻⁸ in ~30 steps; vanilla GD plateaus at 10⁻³.

This is the visual story of why modern deep learning uses Adam by default — robustness to learning rate + adaptive per-parameter scaling.

⚠️ Common exam-relevant traps and gotchas

  1. Gradient Descent ≠ Backpropagation. Backprop = how you compute gradients in a neural net (chain rule, layer by layer). GD = how you use gradients to update parameters. They’re complementary: backprop computes ∇L, GD applies the update. See Gradient Backpropagation.
  2. SGD ≠ GD with batch=1. Strictly, that’s true — but in practice “SGD” usually means mini-batch SGD. Pure batch=1 is too noisy for most problems.
  3. Adam vs. SGD: not always Adam wins. Adam is the default for transformers and most deep learning, but plain SGD with momentum + good LR schedule often beats Adam for vision tasks (ResNet, ImageNet). The “Adam always wins” rule has been broken for years.
  4. Learning rate is the most critical hyperparameter. Too high → diverges. Too low → never converges. Modern practice: learning rate schedules (cosine, warmup, ReduceLROnPlateau).
  5. GD can still get stuck in local minima. Convexity is the only guarantee. For neural nets, the loss surface has many local minima — empirically, most of them are “good enough” (Choromanska et al. 2015), but no theoretical guarantee.

Where Gradient Descent is used today

The answer is: everywhere in deep learning, AND in classical ML.

  • All neural network training — GPT-4, Claude, Gemini, Stable Diffusion, AlphaFold, every model you’ve heard of
  • Logistic regression / linear regression — closed-form solutions exist but GD scales better to huge datasets
  • Matrix factorization — recommendation systems (Netflix, Spotify) trained with SGD on user-item interaction matrices
  • Word embeddings — Word2Vec, GloVe trained with SGD on co-occurrence matrices
  • Bayesian inference — variational inference uses GD to optimize the ELBO
  • Reinforcement learning — policy gradients (PPO, SAC) use GD on the policy parameters
  • Hyperparameter optimization — when the search space is differentiable (DARTS-style)
  • Adversarial attacks — generate adversarial examples via GD on input pixels

Where Gradient Descent doesn’t apply — and what to use instead

DomainUse what insteadWhy
Non-differentiable objectiveSimulated Annealing, Genetic Algorithms, BayesOptNo gradient available — black-box methods only
Discrete / combinatorialConstraint Satisfaction Problems (CSP, SAT), MILPGradients don’t exist in discrete spaces
Expensive function evaluationBayesian Optimization (Optuna, scikit-optimize)BayesOpt is more sample-efficient (10–100× fewer evals than GD)
Tabular dataXGBoost / LightGBM (gradient boosting of trees, not GD on parameters)Tree-based methods often beat NN-based GD on tabular
Quantum / neuromorphic hardwareEquilibrium Propagation, predictive codingBackprop + GD doesn’t map onto these hardware substrates

⚠️ Subtle point: Gradient Boosting (XGBoost) uses a form of gradient descent — but in function space (fitting weak learners to residuals), not parameter space. Same idea, different domain.

Connection to Hill Climbing — the bridge

This is the conceptual unification you should walk into the exam knowing:

StepHill ClimbingGradient Descent
Generate candidatesEnumerate neighborsCompute ∇L (one operation)
Evaluatef(neighbor) for eachNone (gradient already tells you)
Pick directionBest-improving neighbor−∇L
Step sizeFixed (move to the neighbor)Learning rate × gradient magnitude
TerminationNo improving neighbor‖∇L‖ < ε

Gradient descent IS continuous hill climbing with an analytic shortcut. That’s the whole insight. Once you can compute ∇L cheaply, you don’t need to enumerate neighbors anymore.

See also

Tags: methods-of-ai optimization gradient-descent sgd adam
Created: 18-05-26