Loss Surface (Loss Landscape)

methods-of-ai

The loss surface is the graph of the loss function L(θ) over all possible parameter values θ. It is a purely mathematical object — it exists implicitly the moment you define a model + loss + dataset. Algorithms like Gradient Descent, Hill Climbing, or Simulated Annealing don’t “build” the surface; they probe it locally by evaluating L at one point at a time.

This single confusion (“does SGD build the loss surface?”) trips up almost everyone. Short answer: no. The surface is the mathematical object you’re trying to descend; the algorithm only ever sees a thin slice — usually just one point and its local gradient.

🎯 The 6 questions, answered directly

1. What IS the loss surface?

It’s the function L: ℝᴺ → ℝ that maps parameter vector θ ∈ ℝᴺ to a scalar loss value. The “surface” is the graph of this function — the set of points (θ, L(θ)) in ℝᴺ⁺¹.

For a linear regression with 2 weights (w₁, w₂) and MSE loss, this is literally a 3D surface (2D parameter plane + 1D loss = 3D plot). For deep nets it’s not a “surface” you can draw — but the math is identical.

2. How high-dimensional is it?

Dimension = number of trainable parameters + 1. The +1 is the loss value (the “height”).

Model Parameters Loss surface lives in
Linear regression (1 feature + bias) 2 3D (drawable)
Small MLP (e.g. 100 weights) 100 101D
ResNet-50 ~25 million ~25,000,001D
GPT-3 175 billion ~175,000,000,001D
GPT-4 (estimated) ~1.8 trillion ~10¹²D

Humans can visualize at most 3D. Everything else is mathematics + projections.

Model	Parameters	Loss surface lives in
Linear regression (1 feature + bias)	2	3D (drawable)
Small MLP (e.g. 100 weights)	100	101D
ResNet-50	~25 million	~25,000,001D
GPT-3	175 billion	~175,000,000,001D
GPT-4 (estimated)	~1.8 trillion	~10¹²D

3. Is it 3D?

Only for toy models with exactly 2 parameters. Every 3D loss landscape picture you’ve ever seen (the rolling hills, the canyons, the saddle points) is either:

a real 2-parameter problem (regression on synthetic data), OR

a 2D slice of a high-dimensional surface — typically by picking 2 random directions δ₁, δ₂ in parameter space and plotting L(θ* + α·δ₁ + β·δ₂) over a grid of (α, β). The famous Li et al. 2018 paper “Visualizing the Loss Landscape of Neural Nets” uses this trick with “filter normalization” so the visualization is meaningful.

⚠️ Trap: these 3D plots can be deeply misleading — the true high-D surface has properties (like the prevalence of saddle points over minima) that don’t show up in a random 2D slice.

4. How is it generated?

It isn’t generated — that’s the key insight. The surface is implicitly defined by:

L(θ) = (1/N) · Σᵢ ℓ(f_θ(xᵢ), yᵢ)

for your dataset {(xᵢ, yᵢ)} and your loss function ℓ. Once you fix the dataset and the model architecture, every possible θ already has a loss value — you just don’t know what it is until you evaluate.

To visualize a slice, you sample: pick a grid of θ values, evaluate L(θ) at each one, plot. That’s how the pretty 3D pictures get made — expensive grid evaluation. Nobody can do this for a real neural net (175B-dim grid is infeasible) — so we make do with 2D slices.

5. How do algorithms work on it?

They don’t see the whole surface. They probe locally and step:

Algorithm What it queries at each step
Gradient Descent / SGD One point’s value L(θ) + its gradient ∇L(θ) (computed via Gradient Backpropagation)
Hill Climbing Values at one point + its N neighbors
Simulated Annealing Value at one point + one random neighbor
Genetic Algorithms Values at the current population (k points)
Local Beam Search Values at the current k beams’ b successors
Bayesian Optimization Past observations → fits a surrogate model of the surface → picks the most informative next point

Crucially: none of these algorithms “knows” the global shape of the surface. They make local decisions hoping that local improvements lead somewhere good globally. That’s why local optima are such a problem.

Algorithm	What it queries at each step
Gradient Descent / SGD	One point’s value `L(θ)` + its gradient `∇L(θ)` (computed via Gradient Backpropagation)
Hill Climbing	Values at one point + its `N` neighbors
Simulated Annealing	Value at one point + one random neighbor
Genetic Algorithms	Values at the current population (k points)
Local Beam Search	Values at the current `k` beams’ `b` successors
Bayesian Optimization	Past observations → fits a surrogate model of the surface → picks the most informative next point

6. Does the algorithm build the surface from samples it has seen?

No — with one exception.

GD, SGD, HC, SA, GA, LBS, RL: all forget old samples. Each step they query L at a new point, decide, move on. They never accumulate a model of the surface.

Bayesian Optimization is the exception — it builds a surrogate (usually a Gaussian Process) over all observed (θ, L(θ)) pairs, and uses it to predict where the next best evaluation would be. This is precisely why BayesOpt is sample-efficient — it remembers and interpolates.

The trade-off: GD/SGD do millions of cheap local steps; BayesOpt does dozens of expensive informed steps. Choose based on whether evaluating L is cheap (millisecond) or expensive (overnight training run).

🖼️ See it in code — a real 3D loss surface

A 2-parameter linear regression y = w₁·x + w₂ with MSE loss has a 3D loss surface you can actually draw.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # noqa — registers 3D projection
 
# --- Toy dataset: y = 2x + 1 with noise ---
np.random.seed(0)
x = np.linspace(-2, 2, 30)
y_true = 2.0 * x + 1.0 + np.random.normal(0, 0.3, x.shape)
 
# --- Define MSE loss over (w1, w2) ---
def loss(w1, w2):
    y_pred = w1 * x + w2
    return np.mean((y_pred - y_true) ** 2)
 
# --- Evaluate L on a grid of (w1, w2) ---
W1 = np.linspace(-1, 5, 80)
W2 = np.linspace(-2, 4, 80)
WW1, WW2 = np.meshgrid(W1, W2)
L = np.vectorize(loss)(WW1, WW2)
 
# --- Two views: 3D surface + 2D contour ---
fig = plt.figure(figsize=(14, 5))
 
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot_surface(WW1, WW2, L, cmap='viridis', alpha=0.85, edgecolor='none')
ax1.scatter(2.0, 1.0, loss(2.0, 1.0), color='red', s=120, label='true min ≈ (2, 1)')
ax1.set_xlabel('w₁ (slope)'); ax1.set_ylabel('w₂ (bias)'); ax1.set_zlabel('MSE loss')
ax1.set_title('3D loss surface\n(2-param linear regression)')
ax1.legend()
 
ax2 = fig.add_subplot(1, 2, 2)
contour = ax2.contourf(WW1, WW2, L, levels=20, cmap='viridis')
ax2.contour(WW1, WW2, L, levels=20, colors='white', linewidths=0.3, alpha=0.5)
ax2.scatter(2.0, 1.0, color='red', s=120, marker='*', label='true min')
ax2.set_xlabel('w₁'); ax2.set_ylabel('w₂')
ax2.set_title('Same surface as 2D contour (top view)')
plt.colorbar(contour, ax=ax2, label='loss')
ax2.legend()
 
plt.tight_layout(); plt.show()
 
print(f"Loss at true minimum (w1=2, w2=1):  {loss(2.0, 1.0):.4f}")
print(f"Loss at random point   (w1=0, w2=0): {loss(0.0, 0.0):.4f}")
print(f"Loss at far point      (w1=5, w2=−2): {loss(5.0, -2.0):.4f}")

What this shows:

The surface is convex (one global minimum, no local traps) because linear regression with MSE is a quadratic in θ. This is why GD always finds the optimum here.
The minimum sits near (w₁≈2, w₂≈1) — exactly the parameters that generated the noisy data.
Adding more parameters would make this undrawable: 3 params → 4D, 100 params → 101D.

🌋 A more realistic landscape — multiple minima, a saddle, and 4 optimizers fighting it out

The linear-regression bowl above is too easy. Real loss surfaces (and most things you’ll meet in MoAI) are non-convex: multiple local minima, saddle points, narrow valleys. Here’s a custom 2-parameter landscape designed to expose every weakness of every algorithm at once:

4 Gaussian wells of different depths (one is the global min, three are traps of different severity)
A linear ridge running through the middle that creates a long narrow valley
A gentle background bowl that pulls everything toward the origin

We then run 4 optimizers from the same starting point and overlay their trajectories. You see at a glance which one gets stuck where, and why.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # noqa
 
# ════════════════════════════════════════════════════════════════
# 🎛️  Try changing these
# ════════════════════════════════════════════════════════════════
START       = (-3.8, -3.5)   # starting point of all optimizers (in local A's basin)
LR          = 0.10           # learning rate for GD-family
MOMENTUM    = 0.92           # higher momentum so it can escape local A
SA_T0       = 4.0            # initial temperature for SA (higher = more exploration)
SA_COOL     = 0.992          # slower cooling so SA has time to find global
N_STEPS     = 400
SEED        = 3
# ════════════════════════════════════════════════════════════════
 
# --- Define a multi-modal loss: 4 wells of varying depth + gentle background bowl ---
WELLS = [
    # (x0,  y0,  depth, width)   — deeper = lower loss = better minimum
    (-2.5, -2.5,  5.0, 0.9),    # local A — moderate trap, NEAR the starting point
    ( 3.0, -1.0,  8.0, 0.8),    # GLOBAL — deepest, far away in a different quadrant
    (-1.8,  2.8,  3.0, 0.8),    # local B — shallow trap
    ( 1.5,  2.5,  2.5, 0.7),    # local C — very shallow
]
GLOBAL_XY = (3.0, -1.0)
 
def loss(x, y):
    z = 0.10 * (x**2 + y**2)                            # background bowl (creates barriers between wells)
    for x0, y0, depth, width in WELLS:
        z -= depth * np.exp(-((x - x0)**2 + (y - y0)**2) / (2 * width**2))
    return z
 
def grad(x, y, eps=1e-3):
    """Numerical gradient — works for any loss without manual derivation."""
    gx = (loss(x + eps, y) - loss(x - eps, y)) / (2 * eps)
    gy = (loss(x, y + eps) - loss(x, y - eps)) / (2 * eps)
    return gx, gy
 
# --- Four optimizers, same starting point ---
def run_gd(x0, y0, lr=LR, steps=N_STEPS):
    xs, ys = [x0], [y0]
    x, y = x0, y0
    for _ in range(steps):
        gx, gy = grad(x, y)
        x -= lr * gx; y -= lr * gy
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_gd_momentum(x0, y0, lr=LR, mom=MOMENTUM, steps=N_STEPS):
    xs, ys = [x0], [y0]
    x, y = x0, y0
    vx, vy = 0.0, 0.0
    for _ in range(steps):
        gx, gy = grad(x, y)
        vx = mom * vx - lr * gx
        vy = mom * vy - lr * gy
        x += vx; y += vy
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_sa(x0, y0, T0=SA_T0, cool=SA_COOL, steps=N_STEPS, step_size=0.7):
    """Bigger step size + slower cooling so SA can traverse the bowl."""
    random.seed(SEED)
    xs, ys = [x0], [y0]
    x, y = x0, y0; T = T0
    for _ in range(steps):
        nx = x + random.gauss(0, step_size)
        ny = y + random.gauss(0, step_size)
        dE = loss(nx, ny) - loss(x, y)
        if dE < 0 or random.random() < np.exp(-dE / max(T, 1e-9)):
            x, y = nx, ny
        T *= cool
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_random_restart_gd(x0, y0, n_restarts=5, steps_per=N_STEPS // 5):
    """5 random restarts of GD, sampling uniformly across the WHOLE landscape (not just near start)."""
    random.seed(SEED)
    all_xs, all_ys = [], []
    best_loss = np.inf; best_xy = (x0, y0)
    for r in range(n_restarts):
        if r == 0:
            sx, sy = x0, y0                          # 1st run from given start
        else:
            sx = random.uniform(-4.5, 4.5)           # subsequent runs: sample globally
            sy = random.uniform(-4.5, 4.5)
        xs, ys = run_gd(sx, sy, steps=steps_per)
        all_xs.extend(xs); all_ys.extend(ys)
        final_loss = loss(xs[-1], ys[-1])
        if final_loss < best_loss:
            best_loss = final_loss; best_xy = (xs[-1], ys[-1])
    return np.array(all_xs), np.array(all_ys), best_xy
 
# Run all four
gd_x,  gd_y  = run_gd(*START)
gdm_x, gdm_y = run_gd_momentum(*START)
sa_x,  sa_y  = run_sa(*START)
rr_x,  rr_y, rr_best = run_random_restart_gd(*START)
 
# --- Plot: 3D surface + contour with trajectories ---
xx = np.linspace(-5, 5, 200); yy = np.linspace(-5, 5, 200)
XX, YY = np.meshgrid(xx, yy)
ZZ = loss(XX, YY)
 
fig = plt.figure(figsize=(15, 6.5))
 
# Left: 3D surface — neutral 'bone_r' so colored trajectories will pop
ax3d = fig.add_subplot(1, 2, 1, projection='3d')
ax3d.plot_surface(XX, YY, ZZ, cmap='bone_r', alpha=0.85, edgecolor='none', rstride=4, cstride=4)
for x0, y0, depth, _ in WELLS:
    is_global = (x0, y0) == GLOBAL_XY
    ax3d.scatter(x0, y0, loss(x0, y0),
                 color='red' if is_global else 'orange',
                 s=120 if is_global else 70,
                 marker='*' if is_global else 'o',
                 edgecolor='black', linewidth=1.5, zorder=10)
ax3d.set_xlabel('θ₁'); ax3d.set_ylabel('θ₂'); ax3d.set_zlabel('loss')
ax3d.set_title('3D loss surface — 4 wells of varying depth')
ax3d.view_init(elev=35, azim=-55)
 
# Right: contour with all four optimizer trajectories
ax = fig.add_subplot(1, 2, 2)
# Background: light blue-to-yellow cmap with reduced contrast so trajectories dominate
contour = ax.contourf(XX, YY, ZZ, levels=30, cmap='YlGnBu_r', alpha=0.55)
ax.contour(XX, YY, ZZ, levels=20, colors='dimgray', linewidths=0.4, alpha=0.5)
 
# Draw order matters: SA + restarts go down first, then momentum, then vanilla GD ON TOP
# (vanilla GD overlaps with momentum at the start — dashed line makes both visible)
ax.plot(rr_x, rr_y, 'o', color='#000000', ms=3.2,                        # black dots for restarts
        label='Random-Restart GD (5 runs)', alpha=0.85, markeredgecolor='white',
        markeredgewidth=0.6, zorder=3)
ax.plot(sa_x, sa_y, '-', color='#ff00ff', lw=1.4,                        # magenta SA
        label='Simulated Annealing', alpha=0.85, zorder=4)
ax.plot(gdm_x, gdm_y, '-', color='#1f77b4', lw=3.2,                      # solid blue: momentum
        label='GD + momentum', alpha=0.95, zorder=5,
        solid_capstyle='round')
ax.plot(gd_x, gd_y, '--', color='#d62728', lw=2.8,                       # DASHED red on top: vanilla GD
        label='vanilla GD', alpha=1.0, zorder=6,
        dashes=(4, 3))
 
# Mark endpoints of each optimizer with a big bordered marker
def mark_end(ax, x, y, color, size=240):
    ax.scatter(x, y, color=color, s=size, marker='D', edgecolor='white',
               linewidth=2.2, zorder=8)
mark_end(ax, gd_x[-1],  gd_y[-1],  '#d62728')
mark_end(ax, gdm_x[-1], gdm_y[-1], '#1f77b4')
mark_end(ax, sa_x[-1],  sa_y[-1],  '#ff00ff')
mark_end(ax, rr_best[0], rr_best[1], '#000000')
 
# Mark wells: red star = global min, hollow circles = local minima
for x0, y0, depth, _ in WELLS:
    if (x0, y0) == GLOBAL_XY:
        ax.scatter(x0, y0, color='gold', s=550, marker='*', edgecolor='black',
                   linewidth=2.0, zorder=10, label='GLOBAL min')
    else:
        ax.scatter(x0, y0, facecolor='none', s=260, marker='o', edgecolor='black',
                   linewidth=1.8, zorder=9)
ax.scatter(*START, color='lime', s=260, marker='X', edgecolor='black',
           linewidth=2.2, zorder=10, label='start')
 
ax.set_xlabel('θ₁'); ax.set_ylabel('θ₂')
ax.set_title('Same surface (top view) — 4 optimizers compared\n'
             '◇ diamond = where each ended up | ★ gold = global min | ○ = local minima')
ax.legend(loc='upper left', fontsize=8.5, framealpha=0.92)
plt.colorbar(contour, ax=ax, label='loss')
 
plt.tight_layout(); plt.show()
 
# --- Report final results ---
print("\n📊 FINAL LOSSES (from the same starting point)")
print(f"  Vanilla GD          : final = ({gd_x[-1]:+.2f}, {gd_y[-1]:+.2f})   loss = {loss(gd_x[-1], gd_y[-1]):+.3f}")
print(f"  GD + momentum       : final = ({gdm_x[-1]:+.2f}, {gdm_y[-1]:+.2f})   loss = {loss(gdm_x[-1], gdm_y[-1]):+.3f}")
print(f"  Simulated Annealing : final = ({sa_x[-1]:+.2f}, {sa_y[-1]:+.2f})   loss = {loss(sa_x[-1], sa_y[-1]):+.3f}")
print(f"  Random-Restart GD   : best  = ({rr_best[0]:+.2f}, {rr_best[1]:+.2f})   loss = {loss(*rr_best):+.3f}")
print(f"\n  Global minimum is at {GLOBAL_XY}, loss ≈ {loss(*GLOBAL_XY):.3f}")

What you’ll typically see

Optimizer	Where it ends up	Why
Vanilla GD 🔴	Trapped in local A at (−2.5, −2.5)	Starts in local A’s basin, follows the steepest gradient straight down. No mechanism to escape — once it’s in a basin, it converges to that basin’s minimum no matter how shallow.
GD + momentum 🔵	Escapes local A and usually reaches the global min at (3.0, −1.0)	Accumulated velocity (`v ← 0.92·v − η·∇L`) carries it over the small ridge separating local A from the global basin. Without momentum, the gradient at the saddle would stop it.
Simulated Annealing 🟣	Wanders across the landscape, usually settles in the global basin	High initial T=4.0 → accepts uphill moves with high probability → climbs out of local A early; slow cooling (0.992) keeps exploration alive long enough to find the global basin, then refines inside it.
Random-Restart GD ⚫	Almost always finds global — at least one of the 5 restarts lands in the global basin	The 1st run starts from (−3.8, −3.5) → falls into local A. Runs 2-5 are sampled uniformly from the whole [-4.5, 4.5]² square → high chance one lands near (3, −1) → that run converges to the global min and beats the others.

Why this landscape exposes each algorithm’s weakness

GD’s blindness: starting in local A’s basin, the gradient points downhill INTO local A. GD has no concept of “global” — it only knows local slopes.
Momentum’s saving grace: the local-A minimum is shallow enough that velocity built up while descending into it carries the optimizer back UP and over the ridge toward the global basin. This is why momentum often works as a “free upgrade” over vanilla GD.
SA’s wandering: the trajectory looks chaotic in the contour plot — that’s the point. The randomness is what lets it explore globally, but it costs precision and lots of evaluations.
Random-restart’s cost: 5 restarts ≈ 5× the compute. If f is expensive (training a real network), this is prohibitive — but it’s embarrassingly parallel.

The deeper lesson

No single algorithm is best on this landscape. The “right” choice depends on what’s available:

Gradient available + convex-ish → GD/SGD (fast, but myopic)

Gradient available + many local minima → GD with restart or momentum + warm starts

No gradient + rugged → SA, GA, or Bayesian Optimization

Expensive evaluations → BayesOpt (builds a model of the surface from few samples)

This is exactly what Algorithm Decision Tree — MoAI codifies: which algorithm for which surface shape.

🚀 State-of-the-art optimizers — Adam, RMSprop, Nesterov, AdamW

The “vanilla GD + momentum” picture is the 1980s view. Real deep learning hasn’t used pure SGD or pure momentum since ~2015 — modern training runs on adaptive optimizers that automatically tune the learning rate per parameter. This is what enables training GPT-scale models without manually re-tuning learning rates for billions of weights.

The 4 modern workhorses

Optimizer	Year	The one-line idea	Where it dominates
Nesterov Accelerated Gradient (NAG)	1983	Look one step ahead before computing the gradient — gives momentum foresight	Convex problems with momentum (still used in many CV training recipes)
RMSprop	2012 (Hinton’s Coursera lectures)	Divide gradient by a running RMS of recent gradients → per-parameter LR	RNNs, early deep nets (largely superseded by Adam)
Adam	2014 (Kingma & Ba)	RMSprop + momentum + bias correction → the universal default	90%+ of all deep learning training since 2015
AdamW	2017 (Loshchilov & Hutter)	Adam with decoupled weight decay (not adding decay to the gradient but applying it directly)	Default for transformers — GPT, BERT, ViT, LLaMA, Claude

Code — gradient methods vs stochastic methods on a deep multi-well landscape

The landscape has 3 deep local minima + 1 even deeper global (depths 3.0, 3.5, 4.0, 6.5 — all genuine traps), plus a gentle background bowl. From the SW start, the global is diagonally opposite in the SE. We run all 5 modern gradient optimizers + 2 stochastic methods (Simulated Annealing, Random-Restart GD) — and watch which ones get tricked.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # noqa
 
# ════════════════════════════════════════════════════════════════
# 🎛️  Try changing these
# ════════════════════════════════════════════════════════════════
# 4 DEEP wells — every one is a real trap, not a speed-bump.
WELLS = [
    (-2.5, -2.5, 3.5, 1.0),     # local A — deep trap SW (same quadrant as start)
    ( 2.5,  2.5, 4.0, 1.1),     # local B — deep distractor NE
    (-2.5,  2.5, 3.0, 1.0),     # local C — deep distractor NW
    ( 2.5, -2.5, 6.5, 1.0),     # GLOBAL — deepest, diagonally opposite start
]
GLOBAL_XY = (2.5, -2.5)
 
START      = (-4.0, -4.0)        # SW corner — falls into local A's basin
N_STEPS    = 400
LR         = 0.20
B1         = 0.95                # high momentum — won't help against deep wells
SA_T0      = 4.0                 # SA: initial temperature
SA_COOL    = 0.992
SA_SEED    = 1                   # SA seed that reliably reaches global
RR_RESTART = 5                   # number of random-restart runs
RR_SEED    = 0
# ════════════════════════════════════════════════════════════════
 
def loss(x, y):
    z = 0.04 * (x**2 + y**2)                                # gentle background bowl
    for x0, y0, depth, w in WELLS:
        z -= depth * np.exp(-((x - x0)**2 + (y - y0)**2) / (2 * w**2))
    return z
 
def grad(x, y, eps=1e-3):
    gx = (loss(x+eps, y) - loss(x-eps, y)) / (2*eps)
    gy = (loss(x, y+eps) - loss(x, y-eps)) / (2*eps)
    return gx, gy
 
# ---- Gradient-based optimizers (all expected to get tricked) ----
def run_gd(lr=LR, steps=N_STEPS):
    x, y = START; xs, ys = [x], [y]
    for _ in range(steps):
        gx, gy = grad(x, y); x -= lr*gx; y -= lr*gy
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_nesterov(lr=0.15, mom=B1, steps=N_STEPS):
    x, y = START; xs, ys = [x], [y]; vx, vy = 0.0, 0.0
    for _ in range(steps):
        gx, gy = grad(x + mom*vx, y + mom*vy)
        vx = mom*vx - lr*gx; vy = mom*vy - lr*gy
        x += vx; y += vy
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_rmsprop(lr=LR, beta=0.9, eps=1e-8, steps=N_STEPS):
    x, y = START; xs, ys = [x], [y]; s_gx, s_gy = 0.0, 0.0
    for _ in range(steps):
        gx, gy = grad(x, y)
        s_gx = beta*s_gx + (1-beta)*gx**2
        s_gy = beta*s_gy + (1-beta)*gy**2
        x -= lr * gx / (np.sqrt(s_gx) + eps)
        y -= lr * gy / (np.sqrt(s_gy) + eps)
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_adam(lr=LR, b1=B1, b2=0.999, eps=1e-8, steps=N_STEPS):
    x, y = START; xs, ys = [x], [y]
    mx, my, vx, vy = 0.0, 0.0, 0.0, 0.0
    for t in range(1, steps+1):
        gx, gy = grad(x, y)
        mx = b1*mx + (1-b1)*gx;     my = b1*my + (1-b1)*gy
        vx = b2*vx + (1-b2)*gx**2;  vy = b2*vy + (1-b2)*gy**2
        mx_h = mx / (1 - b1**t);     my_h = my / (1 - b1**t)
        vx_h = vx / (1 - b2**t);     vy_h = vy / (1 - b2**t)
        x -= lr * mx_h / (np.sqrt(vx_h) + eps)
        y -= lr * my_h / (np.sqrt(vy_h) + eps)
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_adamw(lr=LR, b1=B1, b2=0.999, eps=1e-8, wd=0.003, steps=N_STEPS):
    x, y = START; xs, ys = [x], [y]
    mx, my, vx, vy = 0.0, 0.0, 0.0, 0.0
    for t in range(1, steps+1):
        gx, gy = grad(x, y)
        mx = b1*mx + (1-b1)*gx;     my = b1*my + (1-b1)*gy
        vx = b2*vx + (1-b2)*gx**2;  vy = b2*vy + (1-b2)*gy**2
        mx_h = mx / (1 - b1**t);     my_h = my / (1 - b1**t)
        vx_h = vx / (1 - b2**t);     vy_h = vy / (1 - b2**t)
        x = x - lr * mx_h / (np.sqrt(vx_h) + eps) - lr * wd * x
        y = y - lr * my_h / (np.sqrt(vy_h) + eps) - lr * wd * y
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
# ---- Stochastic methods (these CAN escape deep wells) ----
def run_sa(T0=SA_T0, cool=SA_COOL, steps=N_STEPS, step_size=0.7, seed=SA_SEED):
    random.seed(seed)
    x, y = START; xs, ys = [x], [y]; T = T0
    for _ in range(steps):
        nx = x + random.gauss(0, step_size); ny = y + random.gauss(0, step_size)
        dE = loss(nx, ny) - loss(x, y)
        if dE < 0 or random.random() < np.exp(-dE/max(T, 1e-9)):
            x, y = nx, ny
        T *= cool
        xs.append(x); ys.append(y)
    return np.array(xs), np.array(ys)
 
def run_random_restart(n_restarts=RR_RESTART, steps_per=N_STEPS // RR_RESTART,
                       lr=LR, seed=RR_SEED):
    random.seed(seed)
    all_xs, all_ys = [START[0]], [START[1]]
    best_loss = np.inf; best_xy = START
    for r in range(n_restarts):
        sx, sy = (START if r == 0
                  else (random.uniform(-4.5, 4.5), random.uniform(-4.5, 4.5)))
        x, y = sx, sy
        for _ in range(steps_per):
            gx, gy = grad(x, y); x -= lr*gx; y -= lr*gy
            all_xs.append(x); all_ys.append(y)
        if loss(x, y) < best_loss:
            best_loss = loss(x, y); best_xy = (x, y)
    return np.array(all_xs), np.array(all_ys), best_xy
 
# Run all 7
gd_x,  gd_y  = run_gd()
rms_x, rms_y = run_rmsprop()
nes_x, nes_y = run_nesterov()
adm_x, adm_y = run_adam()
adw_x, adw_y = run_adamw()
sa_x,  sa_y  = run_sa()
rr_x,  rr_y, rr_best = run_random_restart()
 
# --- Plot: 3D surface + 2D contour with all trajectories ---
xx = np.linspace(-5, 5, 220); yy = np.linspace(-5, 5, 220)
XX, YY = np.meshgrid(xx, yy); ZZ = loss(XX, YY)
 
fig = plt.figure(figsize=(15.5, 7))
 
# Left: 3D surface
ax3d = fig.add_subplot(1, 2, 1, projection='3d')
ax3d.plot_surface(XX, YY, ZZ, cmap='bone_r', alpha=0.85, edgecolor='none', rstride=4, cstride=4)
for x0, y0, depth, _ in WELLS:
    is_g = (x0, y0) == GLOBAL_XY
    ax3d.scatter(x0, y0, loss(x0, y0),
                 color='red' if is_g else 'orange',
                 s=140 if is_g else 80,
                 marker='*' if is_g else 'o',
                 edgecolor='black', linewidth=1.5, zorder=10)
ax3d.set_xlabel('θ₁'); ax3d.set_ylabel('θ₂'); ax3d.set_zlabel('loss')
ax3d.set_title('3D loss surface — 4 deep wells, GLOBAL diagonally opposite start\n'
               '(★ red = global at (2.5, -2.5))')
ax3d.view_init(elev=38, azim=-58)
 
# Right: contour + trajectories
ax = fig.add_subplot(1, 2, 2)
contour = ax.contourf(XX, YY, ZZ, levels=28, cmap='YlGnBu_r', alpha=0.55)
ax.contour(XX, YY, ZZ, levels=18, colors='dimgray', linewidths=0.4, alpha=0.5)
 
# All 5 gradient methods cluster in local A → use DISTINCT linestyles so each
# is recognizable even when paths overlap. End-diamonds get a small offset
# arranged in a cross pattern around the actual endpoint.
grad_methods = [
    # (name,       xs,    ys,    color,     linestyle,           linewidth, diamond-offset)
    ('Vanilla GD', gd_x,  gd_y,  '#404040', '-',                  3.2,      ( 0.00,  0.00)),
    ('RMSprop',    rms_x, rms_y, '#9467bd', (0, (6, 3)),          2.4,      ( 0.30,  0.00)),
    ('Nesterov',   nes_x, nes_y, '#ff7f00', (0, (1, 1.5)),        3.0,      (-0.30,  0.00)),
    ('Adam',       adm_x, adm_y, '#d62728', (0, (3, 1, 1, 1)),    2.4,      ( 0.00,  0.30)),
    ('AdamW',      adw_x, adw_y, '#2ca02c', (0, (5, 2, 1, 2)),    2.4,      ( 0.00, -0.30)),
]
for i, (name, xs, ys, color, style, lw, _) in enumerate(grad_methods):
    ax.plot(xs, ys, color=color, lw=lw, alpha=0.80, linestyle=style,
            label=f'{name} → local A', zorder=4 + i*0.1)
# Draw endpoint diamonds slightly offset so all 5 are individually visible
for name, xs, ys, color, _, _, (dx, dy) in grad_methods:
    ax.scatter(xs[-1] + dx, ys[-1] + dy, color=color, s=200, marker='D',
               edgecolor='white', linewidth=1.8, zorder=9)
 
# Stochastic methods → escape to global (drawn ON TOP, distinct styles)
ax.plot(sa_x, sa_y, '-', color='magenta', lw=1.8, alpha=0.80,
        label='Simulated Annealing → GLOBAL', zorder=6)
ax.scatter(sa_x[-1], sa_y[-1], color='magenta', s=260, marker='D',
           edgecolor='white', linewidth=2.2, zorder=10)
 
ax.plot(rr_x, rr_y, 'o', color='black', ms=2.8, alpha=0.85,
        label='Random-Restart GD → GLOBAL', zorder=5, markeredgecolor='white', markeredgewidth=0.5)
ax.scatter(rr_best[0], rr_best[1], color='black', s=280, marker='D',
           edgecolor='white', linewidth=2.4, zorder=10)
 
# Wells: gold star for global, hollow black rings + labels for locals
for (x0, y0, depth, _), name in zip(WELLS, ['A', 'B', 'C', 'GLOBAL']):
    if name == 'GLOBAL':
        ax.scatter(x0, y0, color='gold', s=580, marker='*', edgecolor='black',
                   linewidth=2.0, zorder=11, label='GLOBAL min')
    else:
        ax.scatter(x0, y0, facecolor='none', s=280, marker='o', edgecolor='black',
                   linewidth=2.0, zorder=10)
        ax.annotate(name, (x0, y0), textcoords='offset points', xytext=(12, 10),
                    fontsize=11, fontweight='bold', color='black')
ax.scatter(*START, color='lime', s=280, marker='X', edgecolor='black',
           linewidth=2.4, zorder=11, label='start')
 
ax.set_xlabel('θ₁'); ax.set_ylabel('θ₂')
ax.set_title('Same surface (top view) — 5 gradient methods trapped, 2 stochastic methods escape\n'
             '◇ diamond = endpoint | ★ gold = global | ○ = local minima')
ax.set_xlim(-5, 5); ax.set_ylim(-5, 5)             # LOCK axes to landscape extent
ax.set_aspect('equal')                              # square plot
ax.legend(loc='upper left', fontsize=7.5, framealpha=0.95, ncol=1)   # upper-left = empty quadrant
plt.colorbar(contour, ax=ax, label='loss')
 
plt.tight_layout(); plt.show()
 
# --- Numeric report ---
well_names = ['A', 'B', 'C', 'GLOBAL']
def nearest_well(end):
    dists = [np.hypot(end[0]-w[0], end[1]-w[1]) for w in WELLS]
    return well_names[int(np.argmin(dists))]
 
print("\n📊 FINAL POSITIONS — gradient methods vs stochastic methods")
print(f"{'method':22s} {'endpoint':>18s} {'loss':>10s} {'→ well':>10s}")
print("-" * 65)
for name, xs, ys, _, _ in grad_methods:
    end = (xs[-1], ys[-1])
    print(f"  {name:20s}: ({end[0]:+.2f}, {end[1]:+.2f})   loss = {loss(*end):+.3f}   → {nearest_well(end)}")
print("-" * 65)
print(f"  {'Simulated Annealing':20s}: ({sa_x[-1]:+.2f}, {sa_y[-1]:+.2f})   loss = {loss(sa_x[-1], sa_y[-1]):+.3f}   → {nearest_well((sa_x[-1], sa_y[-1]))}")
print(f"  {'Random-Restart GD':20s}: ({rr_best[0]:+.2f}, {rr_best[1]:+.2f})   loss = {loss(*rr_best):+.3f}   → {nearest_well(rr_best)}")
print(f"\n  Global at {GLOBAL_XY}, loss = {loss(*GLOBAL_XY):.3f}")
print(f"  Local A at (-2.5, -2.5), loss = {loss(-2.5, -2.5):.3f}")

What this shows — the honest verdict on adaptive optimizers

Method	Outcome	Why
Vanilla GD ⚫	Trapped in local A	Follows gradient straight down into A. No escape mechanism.
RMSprop 🟣	Trapped in local A	Per-parameter LR, no momentum. Once in A’s basin, all gradients point inward → stuck.
Nesterov 🟠	Trapped in local A	Momentum builds toward A, dies at A’s bottom. Not enough to clear a depth-3.5 basin.
Adam 🔴	Trapped in local A	Adam’s momentum + per-param LR aren’t magic — once the gradient is zero at A’s bottom and recent gradients all point back to A, Adam stops.
AdamW 🟢	Trapped in local A	Same as Adam. Weight decay nudges it slightly toward origin but doesn’t help escape.
Simulated Annealing 🟪	Escapes → GLOBAL	Accepts uphill moves probabilistically (`exp(-ΔE/T)`) → climbs out of A early while T is high, eventually settles in global basin.
Random-Restart GD ⬛	Escapes → GLOBAL	1st run trapped in A; runs 2–5 sample new random points across the whole landscape → at least one starts in global’s basin → wins.

The brutal lesson: adaptive optimizers do NOT solve the local-minima problem

Look at the contour plot — all five colored diamonds cluster on top of each other inside local A. Adam, AdamW, Nesterov, RMSprop, vanilla GD all end up at essentially the same point. Despite decades of optimizer research, none of them can escape a genuine deep local minimum. Their advantage is only over vanilla GD on convex-ish problems with ill-conditioning — not on multi-modal landscapes.

The only things that escape deep locals:

Stochasticity in the gradient (mini-batch SGD’s noise, which we don’t simulate here)

Stochasticity in the search (Simulated Annealing’s random uphill moves)

Stochasticity in the initialization (Random-Restart, the canonical fix)

Population diversity (Genetic Algorithms — different chromosomes start in different basins)

This is why real LLM training combines Adam (for ill-conditioning) + mini-batch SGD noise (for escaping locals) + random initialization (for landing in different basins). No single mechanism solves both problems.

The contour plot's 3 stories in one image

Tight cluster of 5 colored diamonds in local A = every gradient method, regardless of momentum/adaptivity, gets fooled identically.

Magenta SA trajectory wanders chaotically across the whole map then settles in global = randomness as the escape mechanism.

Black scattered dots from Random-Restart sampling the whole space = brute-force diversity beats clever gradient tricks on multi-modal problems.

Newer & experimental (2023–2025)

These mostly haven’t displaced AdamW yet, but appear in recent papers:

Lion (Google, 2023) — uses only sign of gradient + momentum. 4× less memory than Adam. Competitive on vision; mixed results on LLMs.
Sophia (Stanford, 2023) — second-order optimizer using a Hessian diagonal estimate. ~2× speedup claimed on GPT-2 scale; not yet adopted at production scale.
Shampoo / Distributed Shampoo (Google) — full second-order via Kronecker-factored approximation. Used internally at Google for some training runs.
Muon (Keller Jordan, 2024) — orthogonalized momentum via Newton-Schulz iteration. Held a brief speed-of-training record on the nanoGPT benchmark in late 2024.

The pattern is the same as every other corner of MoAI: AdamW is the unkillable default, occasionally challenged but never displaced. Backprop computes the gradient; AdamW (or a recent variant) updates the weights. That’s modern deep learning in one sentence.

🌪️ Why real loss surfaces are weird (and high-D matters)

Real neural net loss surfaces have properties our 3D intuition gets wrong:

Property	Intuition (from 3D)	Reality (high-D)
Local minima	The big problem	Surprisingly rare — most “stuck” points are saddle points, not minima
Saddle points	Curiosity	Dominant feature — exponentially more saddles than minima in high-D
Flat regions	Annoying	Common; gradients tiny → optimizer crawls (motivates Adam, momentum)
Sharp vs. flat minima	Doesn’t matter	Flat minima generalize better than sharp ones (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017)
Connected minima	Each min isolated	”Mode connectivity”: minima of large nets are connected by low-loss paths (Garipov et al., 2018)
Symmetry	Each setting is unique	Many parameter settings give identical loss (permute hidden units → same network)

This is why gradient descent on million-parameter networks actually works — the high-dimensional structure means saddle points (which gradient descent escapes) are the obstacle, not local minima (which it cannot).

🔭 How real-world loss landscapes are visualized

For a real neural network you cannot draw ℝ^(N+1) — but you can:

2D slice along 2 random directions (Li et al. 2018, Visualizing the Loss Landscape of Neural Nets):
- Pick two random direction vectors δ₁, δ₂ in parameter space
- Normalize them to match the scale of weights (“filter normalization”)
- Plot L(θ* + α·δ₁ + β·δ₂) over a grid
- Beautiful pictures showing how skip connections (ResNet) flatten the landscape compared to plain CNNs
PCA over training trajectory: Save θ every epoch, do PCA on the trajectory, plot loss in the top-2 PC plane. Shows how training actually moves through the landscape.
Linear interpolation between solutions: Train two networks → linearly interpolate their weights → plot loss along the interpolation. Used to study mode connectivity.

These visualizations are always 2D slices of a fundamentally higher-dimensional object. Useful for intuition — never the full picture.

🎓 Connection to MoAI algorithms

Every search/optimization algorithm in MoAI operates on some notion of a loss/fitness/value surface:

Hill Climbing — myopic local probing; gets stuck on first ridge
Simulated Annealing — accepts uphill moves with exp(−Δ/T) to escape local minima
Local Beam Search — k parallel probes; “fitness landscape” terminology
Genetic Algorithms — population samples spread across the surface; selection + crossover combine samples from different regions
Gradient Descent — uses the gradient (local slope) to descend fastest
Gradient Backpropagation — the computational technique that gives you ∇L(θ) cheaply for neural nets
Q-Function / Reinforcement Learning (RL) — replace “loss surface” with “value surface” V(s); algorithms otherwise look identical

The exam meta-point

“Why does algorithm X get stuck?” almost always reduces to “X cannot see the global shape of the surface — only its local neighborhood.” The fixes (restart, temperature, beams, momentum, second-order info) are all ways of using slightly more global information without paying the full cost of mapping the entire surface.

🪤 Common misconceptions

"SGD builds the loss surface from training samples"

No. SGD evaluates the loss at exactly one point (current θ) on a mini-batch, computes the gradient at that point, takes a step. It never accumulates a model of the surface. The training trajectory through the surface is what we sometimes visualize — but the surface itself is the mathematical object defined by the dataset and the model.

"The loss is 3D because I see 3D pictures of it"

The pictures are 2D slices through high-D surfaces. Real neural net losses live in millions to billions of dimensions.

"Local minima are why deep learning is hard"

In high dimensions, saddle points dominate. Modern optimizers (Adam, momentum) are designed to escape saddles, not minima.

"More parameters → more local minima → harder to optimize"

Counterintuitively, wider networks often have easier loss landscapes — over-parameterization tends to flatten and connect basins (lottery ticket / mode connectivity literature).

Brain Online

Explorer

Loss Surface

Loss Surface (Loss Landscape)

🎯 The 6 questions, answered directly

🖼️ See it in code — a real 3D loss surface

🌋 A more realistic landscape — multiple minima, a saddle, and 4 optimizers fighting it out

What you’ll typically see

Why this landscape exposes each algorithm’s weakness

🚀 State-of-the-art optimizers — Adam, RMSprop, Nesterov, AdamW

The 4 modern workhorses

Code — gradient methods vs stochastic methods on a deep multi-well landscape

What this shows — the honest verdict on adaptive optimizers

Newer & experimental (2023–2025)

🌪️ Why real loss surfaces are weird (and high-D matters)

🔭 How real-world loss landscapes are visualized

🎓 Connection to MoAI algorithms

🪤 Common misconceptions

See also

Algorithms that operate on the loss surface

Exam / study links

Further reading (outside MoAI)

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis

Brain Online

Explorer

Loss Surface

Loss Surface (Loss Landscape)

🎯 The 6 questions, answered directly

🖼️ See it in code — a real 3D loss surface

🌋 A more realistic landscape — multiple minima, a saddle, and 4 optimizers fighting it out

What you’ll typically see

Why this landscape exposes each algorithm’s weakness

🚀 State-of-the-art optimizers — Adam, RMSprop, Nesterov, AdamW

The 4 modern workhorses

Code — gradient methods vs stochastic methods on a deep multi-well landscape

What this shows — the honest verdict on adaptive optimizers

Newer & experimental (2023–2025)

🌪️ Why real loss surfaces are weird (and high-D matters)

🔭 How real-world loss landscapes are visualized

🎓 Connection to MoAI algorithms

🪤 Common misconceptions

See also

Algorithms that operate on the loss surface

Related concepts

Exam / study links

Further reading (outside MoAI)

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis