Simulated Annealing

Simulated annealing is an optimization technique used to find an approximate solution to the minimization or maximization problem of a given function. It is inspired by the process of annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects. The algorithm mimics this process by heating and then slowly cooling a system to adjust its parameters and minimize a cost function. The key components of simulated annealing include the temperature parameter, which gradually decreases according to a cooling schedule, and the probability of accepting worse solutions as the system cools, allowing it to escape local minima and potentially find a global minimum.

Simulated annealing is similar to random exploration for high temperatures and later similar to First-Choice-Hill-Climbing (see Hill Climbing for the underlying mechanism SA extends).

The acceptance probability (Boltzmann rule) ⭐

This is the heart of SA:

For a proposed move from the current state to a neighbour, let Δ = f(neighbour) − f(current).

In plain text: P(accept) = exp(−Δ / T) for a worsening move.

SymbolMeaning
Δchange in objective value, f(neighbour) − f(current)
Tcurrent temperature (controlled by the cooling schedule)
  • This note minimizes (f = cost), so a worse move has Δ > 0, and exp(−Δ/T) is the accept-probability — exactly the np.exp(-delta / T) in the code below.
  • High T → exponent near 0 → P ≈ 1: almost any worse move is accepted (exploration, escape local minima).
  • Low T → exponent very negative → P ≈ 0: only improvements survive (exploitation, behaves like Hill Climbing).
  • Improvements are always accepted (Δ ≤ 0 ⇒ exp(−Δ/T) ≥ 1, capped at probability 1).

⚠️ Sign convention — don’t get tripped up. If a source maximizes instead (e.g. the Methods-of-AI quiz), it writes P = exp(Δ/T) with Δ = f(s') − f(s) < 0 for a worse move. Same rule, flipped sign — because there a worse move lowers f. Minimization → exp(−Δ/T); maximization → exp(Δ/T). Always check which one the slide/exercise uses.

Minimal SA — finds global min where Hill Climbing gets stuck

The function f(x) = x² + 10·sin(x) has:

  • Global minimum at x ≈ −1.31, f ≈ −7.95
  • Local minimum at x ≈ 3.85, f ≈ +8.18
  • Local maximum in between at x ≈ 2, f ≈ +13 (the “barrier”)

Starting at x0 = 3.0 puts the algorithm in the basin of attraction of the local min. Hill Climbing from there slides right to x ≈ 3.85 and gets trapped — to escape it would need to climb a barrier of height ≈ 4.9 (from f = 8.18 up to f = 13). SA can do this via the Boltzmann acceptance rule; HC never will.

⚠️ Important: the trap only exists if you start in the wrong basin. If x0 = 0.0, the slope at x = 0 is positive (f′(0) = 10), so even plain Hill Climbing slides LEFT and finds the global min directly — no trap, no demonstration. The starting point matters.

What to notice in the plots:

Panel 1 — Trajectory over the landscape:

  • Dark points (early) are scattered everywhere — high T = pure exploration. The algorithm visits both basins (local min near x ≈ 2.6 and global min near x ≈ −1.31).
  • Bright (yellow) points (late) cluster tightly around the global minimum — low T = exploitation, the search has “frozen”.
  • The dashed red line marks the final best x.
  • Why subsample (step = 5): without it, 1000 points pile up on the same x positions and you only see whatever color was drawn last. With every 5th point + late-on-top z-order, the dense cluster appears yellow (late) and the sparse exploration appears purple (early) — both colors visible at the same time.

Panel 2 — x(t):

  • The clearest view of exploration → exploitation. Early iterations: x bounces wildly across the full range [−4, 4]. After ~iter 200–400, x snaps to the global min and stays there.
  • The gray dotted line at x = 3.85 shows the local minimum trap Hill Climbing would fall into. SA visits it early but escapes across the barrier at x ≈ 2 (where f ≈ 13).

Panel 3 — Convergence vs. temperature:

  • Blue noisy line = f(current). At high T it jumps wildly (worse moves accepted). As T drops, jumps shrink.
  • Red monotone line = f(best so far). Can only go down. The big drop typically happens early-middle, when T is still warm enough to escape the local min but cold enough to keep good finds.
  • Orange dashed line = T(t). Pure exponential decay — T ← T · 0.995.

Try these one-line changes and re-run (with x0 = 3.0 so the trap is real):

ChangeWhat you should seeWhy
T = 0.1 (low start temperature)Stuck at x ≈ 3.85. Best f stays at ~8.18, never approaches −7.95.exp(−4.9 / 0.1) ≈ 10⁻²¹ — the barrier crossing is essentially impossible. SA becomes pure Hill Climbing from step 1.
cooling = 0.9 (fast cooling)Often stuck at x ≈ 3.85. T drops to 1 by iter ~22 and to 0.1 by iter ~44 → too little time at warm T.Slow cooling is the whole point of annealing. Cool too fast and you freeze before escaping.
step_size = 5.0 (huge neighborhood)Surprisingly often finds the global min — but not because annealing works.With std = 5, one random jump can fly across the entire domain in a single step → SA degenerates into uniform random search. It “succeeds” by luck, not by the temperature mechanism.
x0 = 0.0 (good starting point)Finds global min trivially even with terrible params.x = 0 is already in the global min’s basin — plain HC slides left to it. No trap to escape.

The takeaway: all four “knobs” matter, but they fail in different ways. T and cooling break the escape mechanism; step_size breaks the locality assumption; x0 decides whether there’s even a problem to solve.

Why the mechanism works (read off the plots):

  • exp(-delta / T) is the whole trick: when T is high, even bad moves get accepted (escape local minima); when T is low, only improvements survive (exploitation).
  • cooling = 0.995 → temperature drops slowly. Faster cooling often gets stuck in the local min.
  • Replace random.gauss neighbor generation with a discrete neighbor function and you have SA for combinatorial problems (e.g. 8-Queens, TSP).

Run 1 — baseline trap. Settings: x0=3.0, T=10.0, cooling=0.995, steps=2000, step_size=0.5

Run 2 — clean success. Settings: x0=3.0, T=20.0, cooling=0.995, steps=2000, step_size=0.5

Run 3 — fast-cooling fail. Settings: x0=3.0, T=20.0, cooling=0.9, steps=2000, step_size=0.5

Run 4 — too-large step. Settings: x0=3.0, T=20.0, cooling=0.995, steps=2000, step_size=5.0

Evaluation — what these 4 runs prove

PlotT₀coolingstepResultVerdict
1100.9950.5x = 0.81, f ≈ 7.90Trapped (just barely)
2200.9950.5x = −1.31, f ≈ −7.95Real success — textbook annealing
3200.90.5x = 3.84, f ≈ 8.18❌ Frozen — classic “cool too fast” failure
4200.9955.0x = −1.29, f ≈ −7.95⚠️ Succeeds — but for the wrong reason

Plot 1 — the “almost made it” trap. T₀ = 10 isn’t enough: SA crosses the peak once (best_x = 0.81 is already past the peak at x ≈ 2 coming from the right!) but T drops faster than x can descend into the global basin. Panel 2 shows it clearly — all 1000 iterations stay between x = 1 and x = 5.

Plot 2 — the clean demo. Doubling the starting temperature (T₀ = 20) gives the algorithm enough “energy budget” to cross the barrier twice — once over, once back, until it settles in the global basin. The Panel 2 dark→bright transition shows the wandering from upper region (x > 0) down to x ≈ −1.31. This is the textbook trajectory.

Plot 3 — the most honest failure picture. Look at Panel 3: the orange T-curve drops to ~0 in ~20 iterations. After that, SA = Hill Climbing = stuck in the nearest local min. In Panel 2 you see only a few scattered early purple points between x = 1.5 and 4, then nothing but a flat plateau at x = 3.84. Exactly as predicted.

Plot 4 — “success by accident”. Note the outliers at x ≈ −8 (f ≈ 58!) and x ≈ 5.5 — those are single jumps that step = 5 makes possible. The algorithm finds the global min not through gradual climbing, but because a single random move from x = 3 to x ≈ −2 is possible in one step. Panel 2 confirms it: x is scattered across the entire [−8, 6] range, almost no local movement. This is Monte Carlo search, not annealing anymore.

Three takeaways for the exam

  1. T₀ matters. Plot 1 vs. Plot 2: same function, same starting point, same cooling rate — only doubled initial temperature. Success vs. failure.
  2. Cooling matters more than T₀. Plot 3: even with T₀ = 20, cooling = 0.9 is too aggressive. The heat evaporates before it can do work.
  3. Step size distorts the diagnosis. Plot 4 looks like “SA works” — but it doesn’t. If your tests are meant to prove that SA learns to escape, you have to keep step_size small. Otherwise the algorithm dances over the whole landscape and finds the min by random walk.

Bottom line: Plot 2 is the textbook version. The other three each demonstrate one failure mode — and those are exactly the ones to remember for the exam when asked “when does SA fail?”

Where SA is still used today

Despite its age (Kirkpatrick, Gelatt & Vecchi, 1983), SA is still the default tool in many industries where the search landscape is rugged and no gradients exist.

  • VLSI chip design — the killer application SA was originally invented for. EDA tools (Cadence, Synopsys) still use SA for placement & routing of chips with billions of transistors.
  • Protein structure prediction — Rosetta@home and similar projects use Monte Carlo with SA-like acceptance to sample protein conformations.
  • Quantum annealing (D-Wave) — D-Wave’s quantum computers are physical SA: qubit states evolve through a cooling schedule and find the ground state of an Ising Hamiltonian. Volkswagen demonstrated traffic-flow optimization in Beijing with D-Wave (2017); also portfolio optimization and protein folding.
  • Vehicle routing & logistics — TSP variants for delivery scheduling (UPS, DHL) use SA + restart hybrids.
  • Medical image registration — aligning CT/MRI scans across patients or time when gradient methods get stuck in local minima.
  • Compiler optimization — register allocation and instruction scheduling in some compiler backends use SA-style local search.
  • Hyperparameter tuning — scikit-optimize and some AutoML systems offer SA as an optimizer for high-dimensional search spaces.

Why SA survives: when (a) the objective function is non-differentiable or has many local minima, (b) you can only evaluate but not compute gradients, and (c) you need any reasonable solution quickly — SA is the “no-math-required” baseline that often beats more sophisticated methods.

Where SA was replaced — and by what

DomainWas SA, now …Why
Neural network training (early Boltzmann Machines, 1980s)Backpropagation + SGD (Rumelhart, Hinton 1986)Once gradients are available, any gradient-based method beats SA by orders of magnitude — the information in the gradient is invaluable
Hyperparameter tuning for ML modelsBayesian Optimization (GPs, TPE) → today often OptunaBayesOpt uses a probabilistic model of the objective and explores informedly rather than randomly — 10–100× more sample-efficient on expensive evaluations
Pure combinatorial optimization (ILP-shaped)Gurobi / CPLEX (Branch-and-Cut)When the problem can be formulated as an Integer Linear Program, commercial MILP solvers with cutting planes + branch-and-bound dominate
Bayesian inference (sampling)MCMC variants: HMC, NUTS (PyMC, Stan)SA was historically used as a sampler, but HMC/NUTS use the gradient of the log-posterior and mix far faster
TSP-style problems at small instancesConcorde TSP solver (Branch-and-Cut)Concorde solves TSPs up to ~85,000 cities optimally — SA only gives approximations

Where SA still stands alone: rugged, non-convex, no gradient, black-box evaluation. Exactly the corner where BayesOpt is too expensive and ILP formulation is impossible.

see also

Type:
Tags:
Status:
Location:
Created: 05-11-24 16:15
611 📠Machine Learning
Loss Surface · Hill Climbing · Temperature in neighbour selection · Genetic Algorithms · Search Algorithms
lernzettel_local-search_30-04-26

Source