Gradient Descent (GD) is the workhorse optimizer of modern AI. It’s the continuous version of Hill Climbing: at each step, move in the direction of steepest descent (the negative gradient of the loss function). Every neural network ever trained — GPT-4, Claude, AlphaFold, Stable Diffusion — was optimized with some variant of gradient descent.
GD beats Hill Climbing whenever you have a differentiable loss function: instead of evaluating all neighbors to find the best, you compute one number (the gradient) that tells you both direction and magnitude of the steepest descent for free. This is exponentially more efficient than search-based optimization in high dimensions.
The update rule fits on one line:
θ ← θ − η · ∇L(θ)
where θ are the parameters, ∇L(θ) is the gradient of the loss, and η is the learning rate. That’s the entire algorithm. Everything else (SGD, momentum, Adam, AdamW) is a refinement of this update.
Why it beats Hill Climbing
Hill Climbing
Gradient Descent
Evaluates all (or many) neighbors to find the best
Computes one vector (the gradient) that points in the steepest direction
Cost per step: O(neighborhood size) — exponential in dimension
Cost per step: O(d) — linear in dimension
Works in any search space (discrete, continuous, weird)
Requires differentiable loss function
Stops at local optima
Same problem — but tricks exist (momentum, restarts, SGD noise)
In 1 dimension: same idea. In 1000 dimensions: HC needs ≥1000 evaluations per step; GD needs 1 (the gradient computation, via Gradient Backpropagation). This is why every neural network uses GD, not HC.
The variants — what gets refined and why
Variant
What changes
Why
GD (Batch GD)
Use the full dataset to compute ∇L each step
Most accurate gradient, but slow per step on large data
SGD (Stochastic GD)
Use 1 sample per step
Fast per step; noisy → can escape some local minima
Mini-batch SGD
Use a batch (e.g. 32, 256) per step
Sweet spot — fast + reasonably accurate gradient. Default in deep learning.
Momentum
Update incorporates a fraction of the previous update: v ← βv + ∇L; θ ← θ − ηv
Accelerates in consistent directions; dampens oscillation in narrow valleys
Nesterov
Look-ahead momentum: compute gradient at θ − βv, not θ
Slightly faster convergence than plain momentum
AdaGrad
Per-parameter learning rate scaled by historical gradient magnitude
Good for sparse features; LR decays too fast in long training
RMSprop
AdaGrad with exponential moving average of squared gradients (avoids LR collapse)
Tieleman & Hinton, 2012; precursor to Adam
Adam
RMSprop + momentum (1st + 2nd moments of gradient)
Default optimizer in deep learning (Kingma 2014)
AdamW
Adam with decoupled weight decay (proper L2 regularization)
Loshchilov 2017; current default in transformer training
⚠️ Exam-relevant intuition: Adam combines two ideas:
That’s why Adam is so robust to learning-rate choice — it auto-tunes.
See it in code — 4 optimizers on the same loss landscape
Classic visualization: contour plot of a non-trivial 2D loss + trajectories of GD, Momentum, Nesterov, Adam from the same starting point.
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.# In normal Python (terminal / Jupyter), delete the next 2 lines.import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as plt# A loss landscape with a narrow curved valley (Rosenbrock-like, simplified)def loss(theta): x, y = theta return 0.5 * (x**2 + 10 * y**2) # elongated bowl — narrow in y, wide in xdef grad(theta): x, y = theta return np.array([x, 10 * y])# --- Optimizers ---def gd(theta, lr=0.1, steps=50): history = [theta.copy()] for _ in range(steps): theta = theta - lr * grad(theta) history.append(theta.copy()) return np.array(history)def momentum(theta, lr=0.1, beta=0.9, steps=50): v = np.zeros_like(theta); history = [theta.copy()] for _ in range(steps): v = beta * v + grad(theta) theta = theta - lr * v history.append(theta.copy()) return np.array(history)def nesterov(theta, lr=0.1, beta=0.9, steps=50): v = np.zeros_like(theta); history = [theta.copy()] for _ in range(steps): lookahead = theta - lr * beta * v v = beta * v + grad(lookahead) theta = theta - lr * v history.append(theta.copy()) return np.array(history)def adam(theta, lr=0.5, b1=0.9, b2=0.999, eps=1e-8, steps=50): m = np.zeros_like(theta); v = np.zeros_like(theta); history = [theta.copy()] for t in range(1, steps + 1): g = grad(theta) m = b1 * m + (1 - b1) * g v = b2 * v + (1 - b2) * g * g m_hat = m / (1 - b1**t) v_hat = v / (1 - b2**t) theta = theta - lr * m_hat / (np.sqrt(v_hat) + eps) history.append(theta.copy()) return np.array(history)# Run all 4 from the same starttheta0 = np.array([4.0, 2.0])hist = { 'GD (lr=0.1)': gd(theta0.copy(), lr=0.1), 'Momentum (β=0.9)': momentum(theta0.copy(), lr=0.05, beta=0.9), 'Nesterov (β=0.9)': nesterov(theta0.copy(), lr=0.05, beta=0.9), 'Adam (lr=0.5)': adam(theta0.copy(), lr=0.5),}# --- Plot ---fig, axes = plt.subplots(1, 2, figsize=(14, 5))# Left: trajectories on the loss contourX, Y = np.meshgrid(np.linspace(-5, 5, 200), np.linspace(-3, 3, 200))Z = 0.5 * (X**2 + 10 * Y**2)axes[0].contour(X, Y, Z, levels=30, cmap='Greys', alpha=0.6)colors = {'GD (lr=0.1)': '#e41a1c', 'Momentum (β=0.9)': '#377eb8', 'Nesterov (β=0.9)': '#4daf4a', 'Adam (lr=0.5)': '#984ea3'}for name, traj in hist.items(): axes[0].plot(traj[:, 0], traj[:, 1], 'o-', color=colors[name], lw=1.5, markersize=4, alpha=0.8, label=name)axes[0].scatter(*theta0, s=200, color='black', marker='s', zorder=5, label='start')axes[0].scatter(0, 0, s=200, color='gold', marker='*', zorder=5, label='optimum')axes[0].set(xlabel='θ₁', ylabel='θ₂', title='Optimizer trajectories on elongated bowl\n(narrow valley in θ₂)')axes[0].legend(loc='upper right', fontsize=9); axes[0].grid(alpha=0.3)# Right: loss over iterations (log scale)for name, traj in hist.items(): losses = [loss(t) for t in traj] axes[1].plot(losses, color=colors[name], lw=2, label=name)axes[1].set_yscale('log')axes[1].set(xlabel='iteration', ylabel='loss (log scale)', title='Loss convergence — same start, 4 optimizers')axes[1].legend(loc='upper right'); axes[1].grid(alpha=0.3, which='both')plt.tight_layout(); plt.show()
What to see:
GD (red) zig-zags in the narrow direction (θ₂) — classic ravine behavior. Slow.
Momentum (blue) dampens the zigzag — uses past updates to smooth the trajectory. Faster.
Nesterov (green) very similar to momentum but slightly more decisive (look-ahead).
Adam (purple) auto-scales per-dimension — handles the elongated bowl gracefully even with a much larger learning rate.
Log-scale loss panel makes the speed differences obvious: Adam reaches 10⁻⁸ in ~30 steps; vanilla GD plateaus at 10⁻³.
This is the visual story of why modern deep learning uses Adam by default — robustness to learning rate + adaptive per-parameter scaling.
⚠️ Common exam-relevant traps and gotchas
Gradient Descent ≠ Backpropagation. Backprop = how you compute gradients in a neural net (chain rule, layer by layer). GD = how you use gradients to update parameters. They’re complementary: backprop computes ∇L, GD applies the update. See Gradient Backpropagation.
SGD ≠ GD with batch=1. Strictly, that’s true — but in practice “SGD” usually means mini-batch SGD. Pure batch=1 is too noisy for most problems.
Adam vs. SGD: not always Adam wins. Adam is the default for transformers and most deep learning, but plain SGD with momentum + good LR schedule often beats Adam for vision tasks (ResNet, ImageNet). The “Adam always wins” rule has been broken for years.
Learning rate is the most critical hyperparameter. Too high → diverges. Too low → never converges. Modern practice: learning rate schedules (cosine, warmup, ReduceLROnPlateau).
GD can still get stuck in local minima. Convexity is the only guarantee. For neural nets, the loss surface has many local minima — empirically, most of them are “good enough” (Choromanska et al. 2015), but no theoretical guarantee.
Where Gradient Descent is used today
The answer is: everywhere in deep learning, AND in classical ML.
All neural network training — GPT-4, Claude, Gemini, Stable Diffusion, AlphaFold, every model you’ve heard of
Logistic regression / linear regression — closed-form solutions exist but GD scales better to huge datasets
Matrix factorization — recommendation systems (Netflix, Spotify) trained with SGD on user-item interaction matrices
Word embeddings — Word2Vec, GloVe trained with SGD on co-occurrence matrices
Bayesian inference — variational inference uses GD to optimize the ELBO
Reinforcement learning — policy gradients (PPO, SAC) use GD on the policy parameters
Hyperparameter optimization — when the search space is differentiable (DARTS-style)
Adversarial attacks — generate adversarial examples via GD on input pixels
Where Gradient Descent doesn’t apply — and what to use instead
BayesOpt is more sample-efficient (10–100× fewer evals than GD)
Tabular data
XGBoost / LightGBM (gradient boosting of trees, not GD on parameters)
Tree-based methods often beat NN-based GD on tabular
Quantum / neuromorphic hardware
Equilibrium Propagation, predictive coding
Backprop + GD doesn’t map onto these hardware substrates
⚠️ Subtle point: Gradient Boosting (XGBoost) uses a form of gradient descent — but in function space (fitting weak learners to residuals), not parameter space. Same idea, different domain.
Connection to Hill Climbing — the bridge
This is the conceptual unification you should walk into the exam knowing:
Step
Hill Climbing
Gradient Descent
Generate candidates
Enumerate neighbors
Compute ∇L (one operation)
Evaluate
f(neighbor) for each
None (gradient already tells you)
Pick direction
Best-improving neighbor
−∇L
Step size
Fixed (move to the neighbor)
Learning rate × gradient magnitude
Termination
No improving neighbor
‖∇L‖ < ε
Gradient descent IS continuous hill climbing with an analytic shortcut. That’s the whole insight. Once you can compute ∇L cheaply, you don’t need to enumerate neighbors anymore.
See also
Loss Surface — the mathematical object GD descends; how high-D, how it’s “generated”, how algorithms work on it