Gradient Descent

methods-of-ai

Gradient Descent (GD) is the workhorse optimizer of modern AI. It’s the continuous version of Hill Climbing: at each step, move in the direction of steepest descent (the negative gradient of the loss function). Every neural network ever trained — GPT-4, Claude, AlphaFold, Stable Diffusion — was optimized with some variant of gradient descent.

GD beats Hill Climbing whenever you have a differentiable loss function: instead of evaluating all neighbors to find the best, you compute one number (the gradient) that tells you both direction and magnitude of the steepest descent for free. This is exponentially more efficient than search-based optimization in high dimensions.

The update rule fits on one line:

θ ← θ − η · ∇L(θ)

where θ are the parameters, ∇L(θ) is the gradient of the loss, and η is the learning rate. That’s the entire algorithm. Everything else (SGD, momentum, Adam, AdamW) is a refinement of this update.

Why it beats Hill Climbing

Hill Climbing	Gradient Descent
Evaluates all (or many) neighbors to find the best	Computes one vector (the gradient) that points in the steepest direction
Cost per step: O(neighborhood size) — exponential in dimension	Cost per step: O(d) — linear in dimension
Works in any search space (discrete, continuous, weird)	Requires differentiable loss function
Stops at local optima	Same problem — but tricks exist (momentum, restarts, SGD noise)

In 1 dimension: same idea. In 1000 dimensions: HC needs ≥1000 evaluations per step; GD needs 1 (the gradient computation, via Gradient Backpropagation). This is why every neural network uses GD, not HC.

The variants — what gets refined and why

Variant	What changes	Why
GD (Batch GD)	Use the full dataset to compute ∇L each step	Most accurate gradient, but slow per step on large data
SGD (Stochastic GD)	Use 1 sample per step	Fast per step; noisy → can escape some local minima
Mini-batch SGD	Use a batch (e.g. 32, 256) per step	Sweet spot — fast + reasonably accurate gradient. Default in deep learning.
Momentum	Update incorporates a fraction of the previous update: `v ← βv + ∇L; θ ← θ − ηv`	Accelerates in consistent directions; dampens oscillation in narrow valleys
Nesterov	Look-ahead momentum: compute gradient at `θ − βv`, not `θ`	Slightly faster convergence than plain momentum
AdaGrad	Per-parameter learning rate scaled by historical gradient magnitude	Good for sparse features; LR decays too fast in long training
RMSprop	AdaGrad with exponential moving average of squared gradients (avoids LR collapse)	Tieleman & Hinton, 2012; precursor to Adam
Adam	RMSprop + momentum (1st + 2nd moments of gradient)	Default optimizer in deep learning (Kingma 2014)
AdamW	Adam with decoupled weight decay (proper L2 regularization)	Loshchilov 2017; current default in transformer training

⚠️ Exam-relevant intuition: Adam combines two ideas:

Momentum (1st moment) — smooths out the direction
Adaptive learning rate (2nd moment) — auto-tunes step size per parameter

That’s why Adam is so robust to learning-rate choice — it auto-tunes.

See it in code — 4 optimizers on the same loss landscape

Classic visualization: contour plot of a non-trivial 2D loss + trajectories of GD, Momentum, Nesterov, Adam from the same starting point.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
# In normal Python (terminal / Jupyter), delete the next 2 lines.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
# A loss landscape with a narrow curved valley (Rosenbrock-like, simplified)
def loss(theta):
    x, y = theta
    return 0.5 * (x**2 + 10 * y**2)   # elongated bowl — narrow in y, wide in x
 
def grad(theta):
    x, y = theta
    return np.array([x, 10 * y])
 
# --- Optimizers ---
def gd(theta, lr=0.1, steps=50):
    history = [theta.copy()]
    for _ in range(steps):
        theta = theta - lr * grad(theta)
        history.append(theta.copy())
    return np.array(history)
 
def momentum(theta, lr=0.1, beta=0.9, steps=50):
    v = np.zeros_like(theta); history = [theta.copy()]
    for _ in range(steps):
        v = beta * v + grad(theta)
        theta = theta - lr * v
        history.append(theta.copy())
    return np.array(history)
 
def nesterov(theta, lr=0.1, beta=0.9, steps=50):
    v = np.zeros_like(theta); history = [theta.copy()]
    for _ in range(steps):
        lookahead = theta - lr * beta * v
        v = beta * v + grad(lookahead)
        theta = theta - lr * v
        history.append(theta.copy())
    return np.array(history)
 
def adam(theta, lr=0.5, b1=0.9, b2=0.999, eps=1e-8, steps=50):
    m = np.zeros_like(theta); v = np.zeros_like(theta); history = [theta.copy()]
    for t in range(1, steps + 1):
        g = grad(theta)
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g * g
        m_hat = m / (1 - b1**t)
        v_hat = v / (1 - b2**t)
        theta = theta - lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(theta.copy())
    return np.array(history)
 
# Run all 4 from the same start
theta0 = np.array([4.0, 2.0])
hist = {
    'GD (lr=0.1)':       gd(theta0.copy(),       lr=0.1),
    'Momentum (β=0.9)':  momentum(theta0.copy(), lr=0.05, beta=0.9),
    'Nesterov (β=0.9)':  nesterov(theta0.copy(), lr=0.05, beta=0.9),
    'Adam (lr=0.5)':     adam(theta0.copy(),     lr=0.5),
}
 
# --- Plot ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: trajectories on the loss contour
X, Y = np.meshgrid(np.linspace(-5, 5, 200), np.linspace(-3, 3, 200))
Z = 0.5 * (X**2 + 10 * Y**2)
axes[0].contour(X, Y, Z, levels=30, cmap='Greys', alpha=0.6)
colors = {'GD (lr=0.1)': '#e41a1c', 'Momentum (β=0.9)': '#377eb8',
          'Nesterov (β=0.9)': '#4daf4a', 'Adam (lr=0.5)': '#984ea3'}
for name, traj in hist.items():
    axes[0].plot(traj[:, 0], traj[:, 1], 'o-', color=colors[name], lw=1.5,
                 markersize=4, alpha=0.8, label=name)
axes[0].scatter(*theta0, s=200, color='black', marker='s', zorder=5, label='start')
axes[0].scatter(0, 0,     s=200, color='gold',  marker='*', zorder=5, label='optimum')
axes[0].set(xlabel='θ₁', ylabel='θ₂',
            title='Optimizer trajectories on elongated bowl\n(narrow valley in θ₂)')
axes[0].legend(loc='upper right', fontsize=9); axes[0].grid(alpha=0.3)
 
# Right: loss over iterations (log scale)
for name, traj in hist.items():
    losses = [loss(t) for t in traj]
    axes[1].plot(losses, color=colors[name], lw=2, label=name)
axes[1].set_yscale('log')
axes[1].set(xlabel='iteration', ylabel='loss (log scale)',
            title='Loss convergence — same start, 4 optimizers')
axes[1].legend(loc='upper right'); axes[1].grid(alpha=0.3, which='both')
 
plt.tight_layout(); plt.show()

What to see:

GD (red) zig-zags in the narrow direction (θ₂) — classic ravine behavior. Slow.
Momentum (blue) dampens the zigzag — uses past updates to smooth the trajectory. Faster.
Nesterov (green) very similar to momentum but slightly more decisive (look-ahead).
Adam (purple) auto-scales per-dimension — handles the elongated bowl gracefully even with a much larger learning rate.
Log-scale loss panel makes the speed differences obvious: Adam reaches 10⁻⁸ in ~30 steps; vanilla GD plateaus at 10⁻³.

This is the visual story of why modern deep learning uses Adam by default — robustness to learning rate + adaptive per-parameter scaling.

⚠️ Common exam-relevant traps and gotchas

Gradient Descent ≠ Backpropagation. Backprop = how you compute gradients in a neural net (chain rule, layer by layer). GD = how you use gradients to update parameters. They’re complementary: backprop computes ∇L, GD applies the update. See Gradient Backpropagation.
SGD ≠ GD with batch=1. Strictly, that’s true — but in practice “SGD” usually means mini-batch SGD. Pure batch=1 is too noisy for most problems.
Adam vs. SGD: not always Adam wins. Adam is the default for transformers and most deep learning, but plain SGD with momentum + good LR schedule often beats Adam for vision tasks (ResNet, ImageNet). The “Adam always wins” rule has been broken for years.
Learning rate is the most critical hyperparameter. Too high → diverges. Too low → never converges. Modern practice: learning rate schedules (cosine, warmup, ReduceLROnPlateau).
GD can still get stuck in local minima. Convexity is the only guarantee. For neural nets, the loss surface has many local minima — empirically, most of them are “good enough” (Choromanska et al. 2015), but no theoretical guarantee.

Where Gradient Descent is used today

The answer is: everywhere in deep learning, AND in classical ML.

All neural network training — GPT-4, Claude, Gemini, Stable Diffusion, AlphaFold, every model you’ve heard of
Logistic regression / linear regression — closed-form solutions exist but GD scales better to huge datasets
Matrix factorization — recommendation systems (Netflix, Spotify) trained with SGD on user-item interaction matrices
Word embeddings — Word2Vec, GloVe trained with SGD on co-occurrence matrices
Bayesian inference — variational inference uses GD to optimize the ELBO
Reinforcement learning — policy gradients (PPO, SAC) use GD on the policy parameters
Hyperparameter optimization — when the search space is differentiable (DARTS-style)
Adversarial attacks — generate adversarial examples via GD on input pixels

Where Gradient Descent doesn’t apply — and what to use instead

Domain	Use what instead	Why
Non-differentiable objective	Simulated Annealing, Genetic Algorithms, BayesOpt	No gradient available — black-box methods only
Discrete / combinatorial	Constraint Satisfaction Problems (CSP, SAT), MILP	Gradients don’t exist in discrete spaces
Expensive function evaluation	Bayesian Optimization (Optuna, scikit-optimize)	BayesOpt is more sample-efficient (10–100× fewer evals than GD)
Tabular data	XGBoost / LightGBM (gradient boosting of trees, not GD on parameters)	Tree-based methods often beat NN-based GD on tabular
Quantum / neuromorphic hardware	Equilibrium Propagation, predictive coding	Backprop + GD doesn’t map onto these hardware substrates

⚠️ Subtle point: Gradient Boosting (XGBoost) uses a form of gradient descent — but in function space (fitting weak learners to residuals), not parameter space. Same idea, different domain.

Connection to Hill Climbing — the bridge

This is the conceptual unification you should walk into the exam knowing:

Step	Hill Climbing	Gradient Descent
Generate candidates	Enumerate neighbors	Compute ∇L (one operation)
Evaluate	f(neighbor) for each	None (gradient already tells you)
Pick direction	Best-improving neighbor	−∇L
Step size	Fixed (move to the neighbor)	Learning rate × gradient magnitude
Termination	No improving neighbor	‖∇L‖ < ε

Gradient descent IS continuous hill climbing with an analytic shortcut. That’s the whole insight. Once you can compute ∇L cheaply, you don’t need to enumerate neighbors anymore.

Brain Online

Explorer

Gradient Descent

Gradient Descent

Why it beats Hill Climbing

The variants — what gets refined and why

See it in code — 4 optimizers on the same loss landscape

⚠️ Common exam-relevant traps and gotchas

Where Gradient Descent is used today

Where Gradient Descent doesn’t apply — and what to use instead

Connection to Hill Climbing — the bridge

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis