The loss surface is the graph of the loss function L(θ) over all possible parameter values θ. It is a purely mathematical object — it exists implicitly the moment you define a model + loss + dataset. Algorithms like Gradient Descent, Hill Climbing, or Simulated Annealing don’t “build” the surface; they probe it locally by evaluating L at one point at a time.
This single confusion (“does SGD build the loss surface?”) trips up almost everyone. Short answer: no. The surface is the mathematical object you’re trying to descend; the algorithm only ever sees a thin slice — usually just one point and its local gradient.
🎯 The 6 questions, answered directly
1. What IS the loss surface?
It’s the function L: ℝᴺ → ℝ that maps parameter vector θ ∈ ℝᴺ to a scalar loss value. The “surface” is the graph of this function — the set of points (θ, L(θ)) in ℝᴺ⁺¹.
For a linear regression with 2 weights (w₁, w₂) and MSE loss, this is literally a 3D surface (2D parameter plane + 1D loss = 3D plot). For deep nets it’s not a “surface” you can draw — but the math is identical.
2. How high-dimensional is it?
Dimension = number of trainable parameters + 1. The +1 is the loss value (the “height”).
Model
Parameters
Loss surface lives in
Linear regression (1 feature + bias)
2
3D (drawable)
Small MLP (e.g. 100 weights)
100
101D
ResNet-50
~25 million
~25,000,001D
GPT-3
175 billion
~175,000,000,001D
GPT-4 (estimated)
~1.8 trillion
~10¹²D
Humans can visualize at most 3D. Everything else is mathematics + projections.
3. Is it 3D?
Only for toy models with exactly 2 parameters. Every 3D loss landscape picture you’ve ever seen (the rolling hills, the canyons, the saddle points) is either:
a real 2-parameter problem (regression on synthetic data), OR
a 2D slice of a high-dimensional surface — typically by picking 2 random directions δ₁, δ₂ in parameter space and plotting L(θ* + α·δ₁ + β·δ₂) over a grid of (α, β). The famous Li et al. 2018 paper “Visualizing the Loss Landscape of Neural Nets” uses this trick with “filter normalization” so the visualization is meaningful.
⚠️ Trap: these 3D plots can be deeply misleading — the true high-D surface has properties (like the prevalence of saddle points over minima) that don’t show up in a random 2D slice.
4. How is it generated?
It isn’t generated — that’s the key insight. The surface is implicitly defined by:
L(θ) = (1/N) · Σᵢ ℓ(f_θ(xᵢ), yᵢ)
for your dataset {(xᵢ, yᵢ)} and your loss function ℓ. Once you fix the dataset and the model architecture, every possible θ already has a loss value — you just don’t know what it is until you evaluate.
To visualize a slice, you sample: pick a grid of θ values, evaluate L(θ) at each one, plot. That’s how the pretty 3D pictures get made — expensive grid evaluation. Nobody can do this for a real neural net (175B-dim grid is infeasible) — so we make do with 2D slices.
5. How do algorithms work on it?
They don’t see the whole surface. They probe locally and step:
Past observations → fits a surrogate model of the surface → picks the most informative next point
Crucially: none of these algorithms “knows” the global shape of the surface. They make local decisions hoping that local improvements lead somewhere good globally. That’s why local optima are such a problem.
6. Does the algorithm build the surface from samples it has seen?
No — with one exception.
GD, SGD, HC, SA, GA, LBS, RL: all forget old samples. Each step they query L at a new point, decide, move on. They never accumulate a model of the surface.
Bayesian Optimization is the exception — it builds a surrogate (usually a Gaussian Process) over all observed (θ, L(θ)) pairs, and uses it to predict where the next best evaluation would be. This is precisely why BayesOpt is sample-efficient — it remembers and interpolates.
The trade-off: GD/SGD do millions of cheap local steps; BayesOpt does dozens of expensive informed steps. Choose based on whether evaluating L is cheap (millisecond) or expensive (overnight training run).
🖼️ See it in code — a real 3D loss surface
A 2-parameter linear regression y = w₁·x + w₂ with MSE loss has a 3D loss surface you can actually draw.
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D # noqa — registers 3D projection# --- Toy dataset: y = 2x + 1 with noise ---np.random.seed(0)x = np.linspace(-2, 2, 30)y_true = 2.0 * x + 1.0 + np.random.normal(0, 0.3, x.shape)# --- Define MSE loss over (w1, w2) ---def loss(w1, w2): y_pred = w1 * x + w2 return np.mean((y_pred - y_true) ** 2)# --- Evaluate L on a grid of (w1, w2) ---W1 = np.linspace(-1, 5, 80)W2 = np.linspace(-2, 4, 80)WW1, WW2 = np.meshgrid(W1, W2)L = np.vectorize(loss)(WW1, WW2)# --- Two views: 3D surface + 2D contour ---fig = plt.figure(figsize=(14, 5))ax1 = fig.add_subplot(1, 2, 1, projection='3d')ax1.plot_surface(WW1, WW2, L, cmap='viridis', alpha=0.85, edgecolor='none')ax1.scatter(2.0, 1.0, loss(2.0, 1.0), color='red', s=120, label='true min ≈ (2, 1)')ax1.set_xlabel('w₁ (slope)'); ax1.set_ylabel('w₂ (bias)'); ax1.set_zlabel('MSE loss')ax1.set_title('3D loss surface\n(2-param linear regression)')ax1.legend()ax2 = fig.add_subplot(1, 2, 2)contour = ax2.contourf(WW1, WW2, L, levels=20, cmap='viridis')ax2.contour(WW1, WW2, L, levels=20, colors='white', linewidths=0.3, alpha=0.5)ax2.scatter(2.0, 1.0, color='red', s=120, marker='*', label='true min')ax2.set_xlabel('w₁'); ax2.set_ylabel('w₂')ax2.set_title('Same surface as 2D contour (top view)')plt.colorbar(contour, ax=ax2, label='loss')ax2.legend()plt.tight_layout(); plt.show()print(f"Loss at true minimum (w1=2, w2=1): {loss(2.0, 1.0):.4f}")print(f"Loss at random point (w1=0, w2=0): {loss(0.0, 0.0):.4f}")print(f"Loss at far point (w1=5, w2=−2): {loss(5.0, -2.0):.4f}")
What this shows:
The surface is convex (one global minimum, no local traps) because linear regression with MSE is a quadratic in θ. This is why GD always finds the optimum here.
The minimum sits near (w₁≈2, w₂≈1) — exactly the parameters that generated the noisy data.
Adding more parameters would make this undrawable: 3 params → 4D, 100 params → 101D.
🌋 A more realistic landscape — multiple minima, a saddle, and 4 optimizers fighting it out
The linear-regression bowl above is too easy. Real loss surfaces (and most things you’ll meet in MoAI) are non-convex: multiple local minima, saddle points, narrow valleys. Here’s a custom 2-parameter landscape designed to expose every weakness of every algorithm at once:
4 Gaussian wells of different depths (one is the global min, three are traps of different severity)
A linear ridge running through the middle that creates a long narrow valley
A gentle background bowl that pulls everything toward the origin
We then run 4 optimizers from the same starting point and overlay their trajectories. You see at a glance which one gets stuck where, and why.
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.import micropipawait micropip.install("matplotlib")import numpy as npimport randomimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D # noqa# ════════════════════════════════════════════════════════════════# 🎛️ Try changing these# ════════════════════════════════════════════════════════════════START = (-3.8, -3.5) # starting point of all optimizers (in local A's basin)LR = 0.10 # learning rate for GD-familyMOMENTUM = 0.92 # higher momentum so it can escape local ASA_T0 = 4.0 # initial temperature for SA (higher = more exploration)SA_COOL = 0.992 # slower cooling so SA has time to find globalN_STEPS = 400SEED = 3# ════════════════════════════════════════════════════════════════# --- Define a multi-modal loss: 4 wells of varying depth + gentle background bowl ---WELLS = [ # (x0, y0, depth, width) — deeper = lower loss = better minimum (-2.5, -2.5, 5.0, 0.9), # local A — moderate trap, NEAR the starting point ( 3.0, -1.0, 8.0, 0.8), # GLOBAL — deepest, far away in a different quadrant (-1.8, 2.8, 3.0, 0.8), # local B — shallow trap ( 1.5, 2.5, 2.5, 0.7), # local C — very shallow]GLOBAL_XY = (3.0, -1.0)def loss(x, y): z = 0.10 * (x**2 + y**2) # background bowl (creates barriers between wells) for x0, y0, depth, width in WELLS: z -= depth * np.exp(-((x - x0)**2 + (y - y0)**2) / (2 * width**2)) return zdef grad(x, y, eps=1e-3): """Numerical gradient — works for any loss without manual derivation.""" gx = (loss(x + eps, y) - loss(x - eps, y)) / (2 * eps) gy = (loss(x, y + eps) - loss(x, y - eps)) / (2 * eps) return gx, gy# --- Four optimizers, same starting point ---def run_gd(x0, y0, lr=LR, steps=N_STEPS): xs, ys = [x0], [y0] x, y = x0, y0 for _ in range(steps): gx, gy = grad(x, y) x -= lr * gx; y -= lr * gy xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_gd_momentum(x0, y0, lr=LR, mom=MOMENTUM, steps=N_STEPS): xs, ys = [x0], [y0] x, y = x0, y0 vx, vy = 0.0, 0.0 for _ in range(steps): gx, gy = grad(x, y) vx = mom * vx - lr * gx vy = mom * vy - lr * gy x += vx; y += vy xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_sa(x0, y0, T0=SA_T0, cool=SA_COOL, steps=N_STEPS, step_size=0.7): """Bigger step size + slower cooling so SA can traverse the bowl.""" random.seed(SEED) xs, ys = [x0], [y0] x, y = x0, y0; T = T0 for _ in range(steps): nx = x + random.gauss(0, step_size) ny = y + random.gauss(0, step_size) dE = loss(nx, ny) - loss(x, y) if dE < 0 or random.random() < np.exp(-dE / max(T, 1e-9)): x, y = nx, ny T *= cool xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_random_restart_gd(x0, y0, n_restarts=5, steps_per=N_STEPS // 5): """5 random restarts of GD, sampling uniformly across the WHOLE landscape (not just near start).""" random.seed(SEED) all_xs, all_ys = [], [] best_loss = np.inf; best_xy = (x0, y0) for r in range(n_restarts): if r == 0: sx, sy = x0, y0 # 1st run from given start else: sx = random.uniform(-4.5, 4.5) # subsequent runs: sample globally sy = random.uniform(-4.5, 4.5) xs, ys = run_gd(sx, sy, steps=steps_per) all_xs.extend(xs); all_ys.extend(ys) final_loss = loss(xs[-1], ys[-1]) if final_loss < best_loss: best_loss = final_loss; best_xy = (xs[-1], ys[-1]) return np.array(all_xs), np.array(all_ys), best_xy# Run all fourgd_x, gd_y = run_gd(*START)gdm_x, gdm_y = run_gd_momentum(*START)sa_x, sa_y = run_sa(*START)rr_x, rr_y, rr_best = run_random_restart_gd(*START)# --- Plot: 3D surface + contour with trajectories ---xx = np.linspace(-5, 5, 200); yy = np.linspace(-5, 5, 200)XX, YY = np.meshgrid(xx, yy)ZZ = loss(XX, YY)fig = plt.figure(figsize=(15, 6.5))# Left: 3D surface — neutral 'bone_r' so colored trajectories will popax3d = fig.add_subplot(1, 2, 1, projection='3d')ax3d.plot_surface(XX, YY, ZZ, cmap='bone_r', alpha=0.85, edgecolor='none', rstride=4, cstride=4)for x0, y0, depth, _ in WELLS: is_global = (x0, y0) == GLOBAL_XY ax3d.scatter(x0, y0, loss(x0, y0), color='red' if is_global else 'orange', s=120 if is_global else 70, marker='*' if is_global else 'o', edgecolor='black', linewidth=1.5, zorder=10)ax3d.set_xlabel('θ₁'); ax3d.set_ylabel('θ₂'); ax3d.set_zlabel('loss')ax3d.set_title('3D loss surface — 4 wells of varying depth')ax3d.view_init(elev=35, azim=-55)# Right: contour with all four optimizer trajectoriesax = fig.add_subplot(1, 2, 2)# Background: light blue-to-yellow cmap with reduced contrast so trajectories dominatecontour = ax.contourf(XX, YY, ZZ, levels=30, cmap='YlGnBu_r', alpha=0.55)ax.contour(XX, YY, ZZ, levels=20, colors='dimgray', linewidths=0.4, alpha=0.5)# Draw order matters: SA + restarts go down first, then momentum, then vanilla GD ON TOP# (vanilla GD overlaps with momentum at the start — dashed line makes both visible)ax.plot(rr_x, rr_y, 'o', color='#000000', ms=3.2, # black dots for restarts label='Random-Restart GD (5 runs)', alpha=0.85, markeredgecolor='white', markeredgewidth=0.6, zorder=3)ax.plot(sa_x, sa_y, '-', color='#ff00ff', lw=1.4, # magenta SA label='Simulated Annealing', alpha=0.85, zorder=4)ax.plot(gdm_x, gdm_y, '-', color='#1f77b4', lw=3.2, # solid blue: momentum label='GD + momentum', alpha=0.95, zorder=5, solid_capstyle='round')ax.plot(gd_x, gd_y, '--', color='#d62728', lw=2.8, # DASHED red on top: vanilla GD label='vanilla GD', alpha=1.0, zorder=6, dashes=(4, 3))# Mark endpoints of each optimizer with a big bordered markerdef mark_end(ax, x, y, color, size=240): ax.scatter(x, y, color=color, s=size, marker='D', edgecolor='white', linewidth=2.2, zorder=8)mark_end(ax, gd_x[-1], gd_y[-1], '#d62728')mark_end(ax, gdm_x[-1], gdm_y[-1], '#1f77b4')mark_end(ax, sa_x[-1], sa_y[-1], '#ff00ff')mark_end(ax, rr_best[0], rr_best[1], '#000000')# Mark wells: red star = global min, hollow circles = local minimafor x0, y0, depth, _ in WELLS: if (x0, y0) == GLOBAL_XY: ax.scatter(x0, y0, color='gold', s=550, marker='*', edgecolor='black', linewidth=2.0, zorder=10, label='GLOBAL min') else: ax.scatter(x0, y0, facecolor='none', s=260, marker='o', edgecolor='black', linewidth=1.8, zorder=9)ax.scatter(*START, color='lime', s=260, marker='X', edgecolor='black', linewidth=2.2, zorder=10, label='start')ax.set_xlabel('θ₁'); ax.set_ylabel('θ₂')ax.set_title('Same surface (top view) — 4 optimizers compared\n' '◇ diamond = where each ended up | ★ gold = global min | ○ = local minima')ax.legend(loc='upper left', fontsize=8.5, framealpha=0.92)plt.colorbar(contour, ax=ax, label='loss')plt.tight_layout(); plt.show()# --- Report final results ---print("\n📊 FINAL LOSSES (from the same starting point)")print(f" Vanilla GD : final = ({gd_x[-1]:+.2f}, {gd_y[-1]:+.2f}) loss = {loss(gd_x[-1], gd_y[-1]):+.3f}")print(f" GD + momentum : final = ({gdm_x[-1]:+.2f}, {gdm_y[-1]:+.2f}) loss = {loss(gdm_x[-1], gdm_y[-1]):+.3f}")print(f" Simulated Annealing : final = ({sa_x[-1]:+.2f}, {sa_y[-1]:+.2f}) loss = {loss(sa_x[-1], sa_y[-1]):+.3f}")print(f" Random-Restart GD : best = ({rr_best[0]:+.2f}, {rr_best[1]:+.2f}) loss = {loss(*rr_best):+.3f}")print(f"\n Global minimum is at {GLOBAL_XY}, loss ≈ {loss(*GLOBAL_XY):.3f}")
What you’ll typically see
Optimizer
Where it ends up
Why
Vanilla GD 🔴
Trapped in local A at (−2.5, −2.5)
Starts in local A’s basin, follows the steepest gradient straight down. No mechanism to escape — once it’s in a basin, it converges to that basin’s minimum no matter how shallow.
GD + momentum 🔵
Escapes local A and usually reaches the global min at (3.0, −1.0)
Accumulated velocity (v ← 0.92·v − η·∇L) carries it over the small ridge separating local A from the global basin. Without momentum, the gradient at the saddle would stop it.
Simulated Annealing 🟣
Wanders across the landscape, usually settles in the global basin
High initial T=4.0 → accepts uphill moves with high probability → climbs out of local A early; slow cooling (0.992) keeps exploration alive long enough to find the global basin, then refines inside it.
Random-Restart GD ⚫
Almost always finds global — at least one of the 5 restarts lands in the global basin
The 1st run starts from (−3.8, −3.5) → falls into local A. Runs 2-5 are sampled uniformly from the whole [-4.5, 4.5]² square → high chance one lands near (3, −1) → that run converges to the global min and beats the others.
Why this landscape exposes each algorithm’s weakness
GD’s blindness: starting in local A’s basin, the gradient points downhill INTO local A. GD has no concept of “global” — it only knows local slopes.
Momentum’s saving grace: the local-A minimum is shallow enough that velocity built up while descending into it carries the optimizer back UP and over the ridge toward the global basin. This is why momentum often works as a “free upgrade” over vanilla GD.
SA’s wandering: the trajectory looks chaotic in the contour plot — that’s the point. The randomness is what lets it explore globally, but it costs precision and lots of evaluations.
Random-restart’s cost: 5 restarts ≈ 5× the compute. If f is expensive (training a real network), this is prohibitive — but it’s embarrassingly parallel.
The deeper lesson
No single algorithm is best on this landscape. The “right” choice depends on what’s available:
Gradient available + convex-ish → GD/SGD (fast, but myopic)
Gradient available + many local minima → GD with restart or momentum + warm starts
No gradient + rugged → SA, GA, or Bayesian Optimization
Expensive evaluations → BayesOpt (builds a model of the surface from few samples)
The “vanilla GD + momentum” picture is the 1980s view. Real deep learning hasn’t used pure SGD or pure momentum since ~2015 — modern training runs on adaptive optimizers that automatically tune the learning rate per parameter. This is what enables training GPT-scale models without manually re-tuning learning rates for billions of weights.
The 4 modern workhorses
Optimizer
Year
The one-line idea
Where it dominates
Nesterov Accelerated Gradient (NAG)
1983
Look one step ahead before computing the gradient — gives momentum foresight
Convex problems with momentum (still used in many CV training recipes)
RMSprop
2012 (Hinton’s Coursera lectures)
Divide gradient by a running RMS of recent gradients → per-parameter LR
RNNs, early deep nets (largely superseded by Adam)
Adam
2014 (Kingma & Ba)
RMSprop + momentum + bias correction → the universal default
90%+ of all deep learning training since 2015
AdamW
2017 (Loshchilov & Hutter)
Adam with decoupled weight decay (not adding decay to the gradient but applying it directly)
Default for transformers — GPT, BERT, ViT, LLaMA, Claude
Code — gradient methods vs stochastic methods on a deep multi-well landscape
The landscape has 3 deep local minima + 1 even deeper global (depths 3.0, 3.5, 4.0, 6.5 — all genuine traps), plus a gentle background bowl. From the SW start, the global is diagonally opposite in the SE. We run all 5 modern gradient optimizers + 2 stochastic methods (Simulated Annealing, Random-Restart GD) — and watch which ones get tricked.
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.import micropipawait micropip.install("matplotlib")import numpy as npimport randomimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D # noqa# ════════════════════════════════════════════════════════════════# 🎛️ Try changing these# ════════════════════════════════════════════════════════════════# 4 DEEP wells — every one is a real trap, not a speed-bump.WELLS = [ (-2.5, -2.5, 3.5, 1.0), # local A — deep trap SW (same quadrant as start) ( 2.5, 2.5, 4.0, 1.1), # local B — deep distractor NE (-2.5, 2.5, 3.0, 1.0), # local C — deep distractor NW ( 2.5, -2.5, 6.5, 1.0), # GLOBAL — deepest, diagonally opposite start]GLOBAL_XY = (2.5, -2.5)START = (-4.0, -4.0) # SW corner — falls into local A's basinN_STEPS = 400LR = 0.20B1 = 0.95 # high momentum — won't help against deep wellsSA_T0 = 4.0 # SA: initial temperatureSA_COOL = 0.992SA_SEED = 1 # SA seed that reliably reaches globalRR_RESTART = 5 # number of random-restart runsRR_SEED = 0# ════════════════════════════════════════════════════════════════def loss(x, y): z = 0.04 * (x**2 + y**2) # gentle background bowl for x0, y0, depth, w in WELLS: z -= depth * np.exp(-((x - x0)**2 + (y - y0)**2) / (2 * w**2)) return zdef grad(x, y, eps=1e-3): gx = (loss(x+eps, y) - loss(x-eps, y)) / (2*eps) gy = (loss(x, y+eps) - loss(x, y-eps)) / (2*eps) return gx, gy# ---- Gradient-based optimizers (all expected to get tricked) ----def run_gd(lr=LR, steps=N_STEPS): x, y = START; xs, ys = [x], [y] for _ in range(steps): gx, gy = grad(x, y); x -= lr*gx; y -= lr*gy xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_nesterov(lr=0.15, mom=B1, steps=N_STEPS): x, y = START; xs, ys = [x], [y]; vx, vy = 0.0, 0.0 for _ in range(steps): gx, gy = grad(x + mom*vx, y + mom*vy) vx = mom*vx - lr*gx; vy = mom*vy - lr*gy x += vx; y += vy xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_rmsprop(lr=LR, beta=0.9, eps=1e-8, steps=N_STEPS): x, y = START; xs, ys = [x], [y]; s_gx, s_gy = 0.0, 0.0 for _ in range(steps): gx, gy = grad(x, y) s_gx = beta*s_gx + (1-beta)*gx**2 s_gy = beta*s_gy + (1-beta)*gy**2 x -= lr * gx / (np.sqrt(s_gx) + eps) y -= lr * gy / (np.sqrt(s_gy) + eps) xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_adam(lr=LR, b1=B1, b2=0.999, eps=1e-8, steps=N_STEPS): x, y = START; xs, ys = [x], [y] mx, my, vx, vy = 0.0, 0.0, 0.0, 0.0 for t in range(1, steps+1): gx, gy = grad(x, y) mx = b1*mx + (1-b1)*gx; my = b1*my + (1-b1)*gy vx = b2*vx + (1-b2)*gx**2; vy = b2*vy + (1-b2)*gy**2 mx_h = mx / (1 - b1**t); my_h = my / (1 - b1**t) vx_h = vx / (1 - b2**t); vy_h = vy / (1 - b2**t) x -= lr * mx_h / (np.sqrt(vx_h) + eps) y -= lr * my_h / (np.sqrt(vy_h) + eps) xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_adamw(lr=LR, b1=B1, b2=0.999, eps=1e-8, wd=0.003, steps=N_STEPS): x, y = START; xs, ys = [x], [y] mx, my, vx, vy = 0.0, 0.0, 0.0, 0.0 for t in range(1, steps+1): gx, gy = grad(x, y) mx = b1*mx + (1-b1)*gx; my = b1*my + (1-b1)*gy vx = b2*vx + (1-b2)*gx**2; vy = b2*vy + (1-b2)*gy**2 mx_h = mx / (1 - b1**t); my_h = my / (1 - b1**t) vx_h = vx / (1 - b2**t); vy_h = vy / (1 - b2**t) x = x - lr * mx_h / (np.sqrt(vx_h) + eps) - lr * wd * x y = y - lr * my_h / (np.sqrt(vy_h) + eps) - lr * wd * y xs.append(x); ys.append(y) return np.array(xs), np.array(ys)# ---- Stochastic methods (these CAN escape deep wells) ----def run_sa(T0=SA_T0, cool=SA_COOL, steps=N_STEPS, step_size=0.7, seed=SA_SEED): random.seed(seed) x, y = START; xs, ys = [x], [y]; T = T0 for _ in range(steps): nx = x + random.gauss(0, step_size); ny = y + random.gauss(0, step_size) dE = loss(nx, ny) - loss(x, y) if dE < 0 or random.random() < np.exp(-dE/max(T, 1e-9)): x, y = nx, ny T *= cool xs.append(x); ys.append(y) return np.array(xs), np.array(ys)def run_random_restart(n_restarts=RR_RESTART, steps_per=N_STEPS // RR_RESTART, lr=LR, seed=RR_SEED): random.seed(seed) all_xs, all_ys = [START[0]], [START[1]] best_loss = np.inf; best_xy = START for r in range(n_restarts): sx, sy = (START if r == 0 else (random.uniform(-4.5, 4.5), random.uniform(-4.5, 4.5))) x, y = sx, sy for _ in range(steps_per): gx, gy = grad(x, y); x -= lr*gx; y -= lr*gy all_xs.append(x); all_ys.append(y) if loss(x, y) < best_loss: best_loss = loss(x, y); best_xy = (x, y) return np.array(all_xs), np.array(all_ys), best_xy# Run all 7gd_x, gd_y = run_gd()rms_x, rms_y = run_rmsprop()nes_x, nes_y = run_nesterov()adm_x, adm_y = run_adam()adw_x, adw_y = run_adamw()sa_x, sa_y = run_sa()rr_x, rr_y, rr_best = run_random_restart()# --- Plot: 3D surface + 2D contour with all trajectories ---xx = np.linspace(-5, 5, 220); yy = np.linspace(-5, 5, 220)XX, YY = np.meshgrid(xx, yy); ZZ = loss(XX, YY)fig = plt.figure(figsize=(15.5, 7))# Left: 3D surfaceax3d = fig.add_subplot(1, 2, 1, projection='3d')ax3d.plot_surface(XX, YY, ZZ, cmap='bone_r', alpha=0.85, edgecolor='none', rstride=4, cstride=4)for x0, y0, depth, _ in WELLS: is_g = (x0, y0) == GLOBAL_XY ax3d.scatter(x0, y0, loss(x0, y0), color='red' if is_g else 'orange', s=140 if is_g else 80, marker='*' if is_g else 'o', edgecolor='black', linewidth=1.5, zorder=10)ax3d.set_xlabel('θ₁'); ax3d.set_ylabel('θ₂'); ax3d.set_zlabel('loss')ax3d.set_title('3D loss surface — 4 deep wells, GLOBAL diagonally opposite start\n' '(★ red = global at (2.5, -2.5))')ax3d.view_init(elev=38, azim=-58)# Right: contour + trajectoriesax = fig.add_subplot(1, 2, 2)contour = ax.contourf(XX, YY, ZZ, levels=28, cmap='YlGnBu_r', alpha=0.55)ax.contour(XX, YY, ZZ, levels=18, colors='dimgray', linewidths=0.4, alpha=0.5)# All 5 gradient methods cluster in local A → use DISTINCT linestyles so each# is recognizable even when paths overlap. End-diamonds get a small offset# arranged in a cross pattern around the actual endpoint.grad_methods = [ # (name, xs, ys, color, linestyle, linewidth, diamond-offset) ('Vanilla GD', gd_x, gd_y, '#404040', '-', 3.2, ( 0.00, 0.00)), ('RMSprop', rms_x, rms_y, '#9467bd', (0, (6, 3)), 2.4, ( 0.30, 0.00)), ('Nesterov', nes_x, nes_y, '#ff7f00', (0, (1, 1.5)), 3.0, (-0.30, 0.00)), ('Adam', adm_x, adm_y, '#d62728', (0, (3, 1, 1, 1)), 2.4, ( 0.00, 0.30)), ('AdamW', adw_x, adw_y, '#2ca02c', (0, (5, 2, 1, 2)), 2.4, ( 0.00, -0.30)),]for i, (name, xs, ys, color, style, lw, _) in enumerate(grad_methods): ax.plot(xs, ys, color=color, lw=lw, alpha=0.80, linestyle=style, label=f'{name} → local A', zorder=4 + i*0.1)# Draw endpoint diamonds slightly offset so all 5 are individually visiblefor name, xs, ys, color, _, _, (dx, dy) in grad_methods: ax.scatter(xs[-1] + dx, ys[-1] + dy, color=color, s=200, marker='D', edgecolor='white', linewidth=1.8, zorder=9)# Stochastic methods → escape to global (drawn ON TOP, distinct styles)ax.plot(sa_x, sa_y, '-', color='magenta', lw=1.8, alpha=0.80, label='Simulated Annealing → GLOBAL', zorder=6)ax.scatter(sa_x[-1], sa_y[-1], color='magenta', s=260, marker='D', edgecolor='white', linewidth=2.2, zorder=10)ax.plot(rr_x, rr_y, 'o', color='black', ms=2.8, alpha=0.85, label='Random-Restart GD → GLOBAL', zorder=5, markeredgecolor='white', markeredgewidth=0.5)ax.scatter(rr_best[0], rr_best[1], color='black', s=280, marker='D', edgecolor='white', linewidth=2.4, zorder=10)# Wells: gold star for global, hollow black rings + labels for localsfor (x0, y0, depth, _), name in zip(WELLS, ['A', 'B', 'C', 'GLOBAL']): if name == 'GLOBAL': ax.scatter(x0, y0, color='gold', s=580, marker='*', edgecolor='black', linewidth=2.0, zorder=11, label='GLOBAL min') else: ax.scatter(x0, y0, facecolor='none', s=280, marker='o', edgecolor='black', linewidth=2.0, zorder=10) ax.annotate(name, (x0, y0), textcoords='offset points', xytext=(12, 10), fontsize=11, fontweight='bold', color='black')ax.scatter(*START, color='lime', s=280, marker='X', edgecolor='black', linewidth=2.4, zorder=11, label='start')ax.set_xlabel('θ₁'); ax.set_ylabel('θ₂')ax.set_title('Same surface (top view) — 5 gradient methods trapped, 2 stochastic methods escape\n' '◇ diamond = endpoint | ★ gold = global | ○ = local minima')ax.set_xlim(-5, 5); ax.set_ylim(-5, 5) # LOCK axes to landscape extentax.set_aspect('equal') # square plotax.legend(loc='upper left', fontsize=7.5, framealpha=0.95, ncol=1) # upper-left = empty quadrantplt.colorbar(contour, ax=ax, label='loss')plt.tight_layout(); plt.show()# --- Numeric report ---well_names = ['A', 'B', 'C', 'GLOBAL']def nearest_well(end): dists = [np.hypot(end[0]-w[0], end[1]-w[1]) for w in WELLS] return well_names[int(np.argmin(dists))]print("\n📊 FINAL POSITIONS — gradient methods vs stochastic methods")print(f"{'method':22s} {'endpoint':>18s} {'loss':>10s} {'→ well':>10s}")print("-" * 65)for name, xs, ys, _, _ in grad_methods: end = (xs[-1], ys[-1]) print(f" {name:20s}: ({end[0]:+.2f}, {end[1]:+.2f}) loss = {loss(*end):+.3f} → {nearest_well(end)}")print("-" * 65)print(f" {'Simulated Annealing':20s}: ({sa_x[-1]:+.2f}, {sa_y[-1]:+.2f}) loss = {loss(sa_x[-1], sa_y[-1]):+.3f} → {nearest_well((sa_x[-1], sa_y[-1]))}")print(f" {'Random-Restart GD':20s}: ({rr_best[0]:+.2f}, {rr_best[1]:+.2f}) loss = {loss(*rr_best):+.3f} → {nearest_well(rr_best)}")print(f"\n Global at {GLOBAL_XY}, loss = {loss(*GLOBAL_XY):.3f}")print(f" Local A at (-2.5, -2.5), loss = {loss(-2.5, -2.5):.3f}")
What this shows — the honest verdict on adaptive optimizers
Method
Outcome
Why
Vanilla GD ⚫
Trapped in local A
Follows gradient straight down into A. No escape mechanism.
RMSprop 🟣
Trapped in local A
Per-parameter LR, no momentum. Once in A’s basin, all gradients point inward → stuck.
Nesterov 🟠
Trapped in local A
Momentum builds toward A, dies at A’s bottom. Not enough to clear a depth-3.5 basin.
Adam 🔴
Trapped in local A
Adam’s momentum + per-param LR aren’t magic — once the gradient is zero at A’s bottom and recent gradients all point back to A, Adam stops.
AdamW 🟢
Trapped in local A
Same as Adam. Weight decay nudges it slightly toward origin but doesn’t help escape.
Simulated Annealing 🟪
Escapes → GLOBAL
Accepts uphill moves probabilistically (exp(-ΔE/T)) → climbs out of A early while T is high, eventually settles in global basin.
Random-Restart GD ⬛
Escapes → GLOBAL
1st run trapped in A; runs 2–5 sample new random points across the whole landscape → at least one starts in global’s basin → wins.
The brutal lesson: adaptive optimizers do NOT solve the local-minima problem
Look at the contour plot — all five colored diamonds cluster on top of each other inside local A. Adam, AdamW, Nesterov, RMSprop, vanilla GD all end up at essentially the same point. Despite decades of optimizer research, none of them can escape a genuine deep local minimum. Their advantage is only over vanilla GD on convex-ish problems with ill-conditioning — not on multi-modal landscapes.
The only things that escape deep locals:
Stochasticity in the gradient (mini-batch SGD’s noise, which we don’t simulate here)
Stochasticity in the search (Simulated Annealing’s random uphill moves)
Stochasticity in the initialization (Random-Restart, the canonical fix)
Population diversity (Genetic Algorithms — different chromosomes start in different basins)
This is why real LLM training combines Adam (for ill-conditioning) + mini-batch SGD noise (for escaping locals) + random initialization (for landing in different basins). No single mechanism solves both problems.
The contour plot's 3 stories in one image
Tight cluster of 5 colored diamonds in local A = every gradient method, regardless of momentum/adaptivity, gets fooled identically.
Magenta SA trajectory wanders chaotically across the whole map then settles in global = randomness as the escape mechanism.
Black scattered dots from Random-Restart sampling the whole space = brute-force diversity beats clever gradient tricks on multi-modal problems.
Newer & experimental (2023–2025)
These mostly haven’t displaced AdamW yet, but appear in recent papers:
Lion (Google, 2023) — uses only sign of gradient + momentum. 4× less memory than Adam. Competitive on vision; mixed results on LLMs.
Sophia (Stanford, 2023) — second-order optimizer using a Hessian diagonal estimate. ~2× speedup claimed on GPT-2 scale; not yet adopted at production scale.
Shampoo / Distributed Shampoo (Google) — full second-order via Kronecker-factored approximation. Used internally at Google for some training runs.
Muon (Keller Jordan, 2024) — orthogonalized momentum via Newton-Schulz iteration. Held a brief speed-of-training record on the nanoGPT benchmark in late 2024.
The pattern is the same as every other corner of MoAI: AdamW is the unkillable default, occasionally challenged but never displaced. Backprop computes the gradient; AdamW (or a recent variant) updates the weights. That’s modern deep learning in one sentence.
🌪️ Why real loss surfaces are weird (and high-D matters)
Real neural net loss surfaces have properties our 3D intuition gets wrong:
Property
Intuition (from 3D)
Reality (high-D)
Local minima
The big problem
Surprisingly rare — most “stuck” points are saddle points, not minima
Saddle points
Curiosity
Dominant feature — exponentially more saddles than minima in high-D
Flat minima generalize better than sharp ones (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017)
Connected minima
Each min isolated
”Mode connectivity”: minima of large nets are connected by low-loss paths (Garipov et al., 2018)
Symmetry
Each setting is unique
Many parameter settings give identical loss (permute hidden units → same network)
This is why gradient descent on million-parameter networks actually works — the high-dimensional structure means saddle points (which gradient descent escapes) are the obstacle, not local minima (which it cannot).
🔭 How real-world loss landscapes are visualized
For a real neural network you cannot draw ℝ^(N+1) — but you can:
2D slice along 2 random directions (Li et al. 2018, Visualizing the Loss Landscape of Neural Nets):
Pick two random direction vectors δ₁, δ₂ in parameter space
Normalize them to match the scale of weights (“filter normalization”)
Plot L(θ* + α·δ₁ + β·δ₂) over a grid
Beautiful pictures showing how skip connections (ResNet) flatten the landscape compared to plain CNNs
PCA over training trajectory: Save θ every epoch, do PCA on the trajectory, plot loss in the top-2 PC plane. Shows how training actually moves through the landscape.
Linear interpolation between solutions: Train two networks → linearly interpolate their weights → plot loss along the interpolation. Used to study mode connectivity.
These visualizations are always 2D slices of a fundamentally higher-dimensional object. Useful for intuition — never the full picture.
🎓 Connection to MoAI algorithms
Every search/optimization algorithm in MoAI operates on some notion of a loss/fitness/value surface:
Hill Climbing — myopic local probing; gets stuck on first ridge
Simulated Annealing — accepts uphill moves with exp(−Δ/T) to escape local minima
Local Beam Search — k parallel probes; “fitness landscape” terminology
Genetic Algorithms — population samples spread across the surface; selection + crossover combine samples from different regions
Gradient Descent — uses the gradient (local slope) to descend fastest
Gradient Backpropagation — the computational technique that gives you ∇L(θ) cheaply for neural nets
“Why does algorithm X get stuck?” almost always reduces to “X cannot see the global shape of the surface — only its local neighborhood.” The fixes (restart, temperature, beams, momentum, second-order info) are all ways of using slightly more global information without paying the full cost of mapping the entire surface.
🪤 Common misconceptions
"SGD builds the loss surface from training samples"
No. SGD evaluates the loss at exactly one point (current θ) on a mini-batch, computes the gradient at that point, takes a step. It never accumulates a model of the surface. The training trajectory through the surface is what we sometimes visualize — but the surface itself is the mathematical object defined by the dataset and the model.
"The loss is 3D because I see 3D pictures of it"
The pictures are 2D slices through high-D surfaces. Real neural net losses live in millions to billions of dimensions.
"Local minima are why deep learning is hard"
In high dimensions, saddle points dominate. Modern optimizers (Adam, momentum) are designed to escape saddles, not minima.
"More parameters → more local minima → harder to optimize"
Counterintuitively, wider networks often have easier loss landscapes — over-parameterization tends to flatten and connect basins (lottery ticket / mode connectivity literature).