Local Beam Search maintains k states simultaneously instead of one. At each step, it generates all successors of all k current states, then keeps the best k overall. This pools information across the k beams — promising regions get more search effort, dead-end beams die out.
It looks like “k parallel Hill Climbs” but is fundamentally different: states share information, so beams concentrate on promising regions. This is also its weakness — beams can collapse onto a single region, losing diversity.
🔦 What IS a “beam” — and why do we need k of them?
A "beam" = one candidate solution being tracked.
Think of a beam as a flashlight pointed at the search space. Hill Climbing has 1 flashlight (one current state). Local Beam Search has k flashlights at once, all moving in parallel.
Why k > 1? A single flashlight (Hill Climbing) gets stuck the moment it walks into a local optimum — there’s nothing better in its immediate neighborhood, so it stops. With k flashlights spread across the landscape, the chance that at least one lands in the global optimum’s basin is dramatically higher.
The concrete object: A beam is just a value — for our example f(x) = x² + 10·sin(x), each beam is a single number x. For k=4, we track 4 such numbers. For an 8-queens problem, each beam would be a full board configuration (a state). For LLM text generation, each beam is a partial sentence.
🎯 How are the next k beams chosen? — the central mechanism
This is the part that confuses people. Here is the exact procedure for one iteration, with k=4 and b=5 (each beam generates 5 successors):
Step 1 — Generate candidates.
Each of the 4 current beams produces b=5 successors (small random perturbations).
→ Total candidate pool: 4 × 5 = 20 candidates.
👉 HOW exactly is a child generated? (the neighborhood function)
A “child” is whatever the neighborhood function of the problem returns — LBS itself doesn’t care about the mechanism. Common choices:
56 possible children per state (8 cols × 7 alt. rows)
TSP
Swap two cities, or reverse a segment (2-opt)
up to n·(n−1)/2 children
Boolean SAT
Flip the value of one variable
exactly n children for n variables
LLM decoding
Append one possible next token
vocab_size children (~50k for GPT-2)
In our code:
def neighbors(x, step=STEP_SIZE, n=B): return [x + random.gauss(0, step) for _ in range(n)]
Each call returns n=B=5 children, each = parent + Gaussian noise with std step_size. So:
STEP_SIZE=0.1 → children very close to parent (refine locally)
STEP_SIZE=2.0 → children scattered widely (explore broadly)
This is the single knob that controls exploration vs. exploitation in our toy example. In real problems it might be the mutation rate, the temperature, or the structure of the move set — but the role is the same.
Step 2 — Pool them ALL together.
Throw all 20 candidates into a single bucket. Forget which parent produced which child. They’re all equal contenders now.
Step 3 — Rank by fitness.
Sort all 20 candidates by f-value (smallest = best, since we’re minimizing).
Step 4 — Keep the top k.
Take the 4 best out of all 20. These become the new beams for the next iteration. The other 16 are discarded.
The single most important consequence
Because all 20 candidates compete in one global ranking, a single “good” parent can have all 4 of its children chosen — wiping out the other 3 lineages entirely. This is the “rich-get-richer” effect and the source of LBS’s diversity-collapse problem.
Example: If Parent A is near the global min (f ≈ −7) and Parents B, C, D are higher up (f ≈ 10, 15, 20), Parent A’s 5 children will likely all have lower f-values than any of B/C/D’s children → all 4 survivors come from Parent A → B, C, D’s lineages are extinct.
Visual cue in the step-by-step plot below:
Each candidate dot is colored by its parent.
The 4 survivors get a black ring.
Count the colors inside the rings — if all 4 are the same color, that parent monopolized the next generation. That’s the collapse mechanism in action.
The algorithm
1. INITIALIZE — start with k random states
2. EXPAND — generate ALL successors of ALL k current states
3. SELECT — pick the best k from the combined successor pool
4. CHECK — if any state is a goal → return it
if no improvement possible → terminate
5. REPEAT
Two key differences from k independent Hill Climbs:
Step 3 pools successors across all beams — winning states from one beam can “infect” another
A high-fitness region naturally attracts more beams over time → rich get richer
⚠️ Beam Search vs. Parallel Hill Climbing — the classic exam trap
This is one of the most-asked exam comparisons:
Parallel Hill Climbing
Local Beam Search
k states
✓
✓
Each step
Each state picks ITS OWN best successor
All k·b successors pooled, top k kept
Information sharing between beams
❌ — independent searches
✓ — globally best states win
Failure mode
Each beam stuck in its own local optimum
All beams collapse into one region (diversity collapse)
⚠️ Pure Local Beam Search often degenerates into “k copies of one state” because as soon as one beam finds a hot region, all k beams flock there. The fix is Stochastic Beam Search.
Stochastic Beam Search
Instead of deterministically picking the top k successors, sample k of them with probability proportional to fitness. This keeps diversity — weaker beams sometimes survive, preserving exploration.
It’s essentially the same trick as fitness-proportional selection in Genetic Algorithms — population-based search relies on stochastic selection to avoid premature convergence.
Step-by-step visualization — the selection mechanism made visible
The single thing that confuses people about LBS is how the next k beams get chosen. Here’s a 6-panel walkthrough on f(x) = x² + 10·sin(x). Each panel shows ONE iteration:
Big colored circles = the k current beams
Small gray dots = ALL k·b successors generated this step (the candidate pool)
Highlighted with black ring = the top-k selected for the next iteration
Watch how the pool of small dots gets filtered down to k big circles — and how those k often all come from one or two parents (rich-get-richer effect).
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.# In normal Python (terminal / Jupyter), delete the next 2 lines.import micropipawait micropip.install("matplotlib")import numpy as npimport randomimport matplotlib.pyplot as plt# ════════════════════════════════════════════════════════════════# 🎛️ Change these → re-run to see how the mechanism reacts# ════════════════════════════════════════════════════════════════K = 4 # number of beams (try 2, 4, 8) — keep ≤ 10 for the color paletteB = 5 # successors per beam (try 3, 5, 10) — pool size = K·BSTEP_SIZE = 0.5 # neighborhood radius (try 0.2 vs 1.5)N_PANELS = 6 # how many iterations to plot (snapshots = N_PANELS)SEED = 0 # change for a different random start# ════════════════════════════════════════════════════════════════def f(x): return x**2 + 10 * np.sin(x)def neighbors(x, step=STEP_SIZE, n=B): """Each beam generates n random neighbors.""" return [x + random.gauss(0, step) for _ in range(n)]# --- Run LBS and record the selection at every step ---random.seed(SEED)beams = [random.uniform(-5, 5) for _ in range(K)]# Each snapshot: (parents_before_step, list_of_(parent_idx, child_x), selected_top_k)snapshots = [(list(beams), [], [])]for _ in range(N_PANELS - 1): pairs = [(p_idx, child) for p_idx, parent in enumerate(beams) for child in neighbors(parent)] succ_xs = [c for _, c in pairs] sorted_succs = sorted(succ_xs, key=f) selected = sorted_succs[:K] snapshots.append((list(beams), pairs, selected)) beams = selected# --- Plot N_PANELS panels in a 2-row grid (generous spacing) ---n_cols = (N_PANELS + 1) // 2fig, axes = plt.subplots(2, n_cols, figsize=(4.2 * n_cols, 9))axes = axes.flatten()x_grid = np.linspace(-5, 5, 400)palette = plt.cm.tab10(np.linspace(0, 1, max(K, 4)))for i, (current_beams, pairs, selected) in enumerate(snapshots[:N_PANELS]): ax = axes[i] ax.plot(x_grid, [f(x) for x in x_grid], color='lightgray', lw=2.2, zorder=1) # Parent→child lines (color = parent's color) — both ends ON the curve for p_idx, child_x in pairs: parent_x = current_beams[p_idx] ax.plot([parent_x, child_x], [f(parent_x), f(child_x)], color=palette[p_idx], alpha=0.35, lw=1.0, zorder=2) # Candidates ON the curve — colored by parent (medium dots) for p_idx, child_x in pairs: ax.scatter(child_x, f(child_x), s=70, color=palette[p_idx], edgecolor='black', linewidth=0.6, alpha=0.85, zorder=4) # Parents (colored squares) ON the curve for b, c in zip(current_beams, palette): ax.scatter(b, f(b), s=230, color=c, marker='s', edgecolor='black', linewidth=1.8, zorder=6, alpha=0.95) # Survivors → BIG hollow ring around them (visually encloses, doesn't obscure) for s in selected: ax.scatter(s, f(s), s=520, facecolor='none', edgecolor='black', linewidth=2.6, zorder=5) ax.axvline(-1.31, color='green', ls=':', alpha=0.5) ax.axvline(3.85, color='red', ls=':', alpha=0.5) n_cand = len(pairs) title = f'iter {i} — initial {K} beams' if not pairs else \ f'iter {i} — {n_cand} candidates → top {K} survive' ax.set_title(title, fontsize=11) ax.set_ylim(-14, 32); ax.set_xlim(-5.5, 5.5) ax.set_xticks([]); ax.set_yticks([]) ax.set_xlabel('x', fontsize=8); ax.set_ylabel('f(x)', fontsize=8)# Legend on first panelaxes[0].text(-5.2, 30, f'■ square = parent beam ({K} total)\n' f'● dot = candidate (colored by parent, {K*B} total)\n' f'○ ring = selected survivor (top {K})\n' f'lines = parent → its children', fontsize=8, va='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.92))plt.suptitle(f'Local Beam Search — step-by-step (K={K} beams, B={B} successors → {K*B} candidates per iter)\n' f'Count colors inside the rings: if all {K} survivors share 1 color, that parent monopolized the next gen.', fontsize=12)plt.tight_layout(); plt.subplots_adjust(top=0.88, hspace=0.30, wspace=0.18); plt.show()
What to watch for:
iter 0: 4 colored beams scattered randomly
iter 1: 20 small gray dots appear (each beam shot 5 random neighbors). The 4 with black rings are the survivors — they go on as the colored beams in iter 2.
iter 2-3: Notice that often ALL 4 survivors come from just 1-2 parents — the parents in the good region produced the best children, so their lineage dominates.
iter 4-5: All 4 beams clustered tightly in one region. Diversity has collapsed.
This is the central mechanism: at each step, all parents’ successors compete globally, not within-parent. So a beam in a bad region whose children are bad → that beam’s lineage dies out. A beam in a great region whose children are great → that beam’s lineage gets multiplied.
Direct comparison: Local Beam Search vs. Parallel Hill Climbing
Same starting points, same neighborhood function, only the selection rule differs:
🐍 Code anzeigen / ausblenden
import randomimport matplotlib.pyplot as pltimport numpy as npdef f(x): return x**2 + 10 * np.sin(x)def neighbors(x, step=0.5, n=5): return [x + random.gauss(0, step) for _ in range(n)]def run_parallel_hc(beams, steps=15): """Each beam picks its OWN best successor — no sharing.""" history = [list(beams)] for _ in range(steps): next_beams = [min(neighbors(b), key=f) for b in beams] # Only move if it's an improvement (HC rule) next_beams = [n if f(n) < f(b) else b for n, b in zip(next_beams, beams)] beams = next_beams history.append(list(beams)) return historydef run_local_beam_search(beams, k, steps=15): """All successors pooled — top-k overall survive.""" history = [list(beams)] for _ in range(steps): all_succs = [child for parent in beams for child in neighbors(parent)] beams = sorted(all_succs, key=f)[:k] history.append(list(beams)) return historyrandom.seed(42)initial = [-3.5, -1.0, 2.5, 4.5] # 4 starting pointsh_par = run_parallel_hc(initial.copy())random.seed(42) # same RNG for fair compareh_lbs = run_local_beam_search(initial.copy(), k=4)# --- Plot side by side ---fig, axes = plt.subplots(1, 2, figsize=(14, 5))x_grid = np.linspace(-5, 5, 400)for ax, history, title in [(axes[0], h_par, 'Parallel Hill Climbing\n(each beam independent)'), (axes[1], h_lbs, 'Local Beam Search\n(all successors pooled, top-k survives)')]: ax.plot(x_grid, [f(x) for x in x_grid], color='lightgray', lw=2.5) # Plot each beam's trajectory as a line of points colored by iteration history_arr = np.array(history) iters = np.arange(len(history_arr)) for beam_idx in range(history_arr.shape[1]): xs = history_arr[:, beam_idx] ax.scatter(xs, [f(x) for x in xs], c=iters, cmap='viridis', s=60, alpha=0.7, edgecolor='none') ax.axvline(-1.31, color='green', ls=':', alpha=0.5, label='global min') ax.axvline(3.85, color='red', ls=':', alpha=0.5, label='local min') ax.set_title(title); ax.legend() ax.set_xlabel('x'); ax.set_ylabel('f(x)')plt.tight_layout(); plt.show()
The decisive contrast:
Parallel HC (left): 4 trajectories stay independent. Each beam descends into whichever basin it started in — 2 beams end at the global min, 2 end at the local min. No information transfer.
LBS (right): 4 trajectories collapse onto one region — typically the one that produced the best successors earliest. You might end up with all 4 beams in the global min (good!) — or all 4 stuck in the local min (catastrophic!).
This is exactly why Stochastic Beam Search exists — without stochasticity in selection, LBS gambles its diversity away.
See it in code (original convergence-tracking view)
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.# In normal Python (terminal / Jupyter), delete the next 2 lines.import micropipawait micropip.install("matplotlib")import numpy as npimport randomimport matplotlib.pyplot as plt# The same f(x) we used for HC + SAdef f(x): return x**2 + 10 * np.sin(x)def neighbors(x, step=0.3, n=20): return [x + random.gauss(0, step) for _ in range(n)]def local_beam_search(f, k=5, max_iter=100, stochastic=False): states = [random.uniform(-5, 5) for _ in range(k)] history = [list(states)] for _ in range(max_iter): # Expand: all successors of all states succs = [(s, f(s)) for st in states for s in neighbors(st)] if stochastic: # Stochastic: sample k with prob ∝ 1/f (we're minimizing → smaller is better) weights = [1 / (s[1] + 100) for s in succs] # +100 to keep positive picks = random.choices(succs, weights=weights, k=k) states = [p[0] for p in picks] else: # Deterministic: top k by fitness (minimization → smallest f) states = [s for s, _ in sorted(succs, key=lambda x: x[1])[:k]] history.append(list(states)) return history# --- Compare: deterministic vs stochastic beam search ---random.seed(42)h_det = local_beam_search(f, k=5, stochastic=False)random.seed(42)h_stoc = local_beam_search(f, k=5, stochastic=True)# --- Diversity metric: spread of the k states ---def spread(states): return max(states) - min(states)div_det = [spread(s) for s in h_det]div_stoc = [spread(s) for s in h_stoc]# --- Plot ---fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))# Left: positions of the k beams over time (deterministic — they all clump)for i in range(5): axes[0].plot([h[i] for h in h_det], color='steelblue', alpha=0.4, lw=1)for i in range(5): axes[0].plot([h[i] for h in h_stoc], color='red', alpha=0.4, lw=1)axes[0].set(xlabel='iteration', ylabel='x position of k beams', title='Beam positions over time\n(blue=deterministic, red=stochastic)')axes[0].axhline(-1.31, color='green', ls=':', alpha=0.5, label='global min')axes[0].axhline(3.85, color='gray', ls=':', alpha=0.5, label='local min (trap)')axes[0].legend()# Right: diversity (spread) over timeaxes[1].plot(div_det, color='steelblue', lw=2, label='deterministic beam')axes[1].plot(div_stoc, color='red', lw=2, label='stochastic beam')axes[1].set(xlabel='iteration', ylabel='spread (max x − min x) of k beams', title='Diversity collapse — deterministic vs. stochastic beam')axes[1].legend(); axes[1].grid(alpha=0.3)plt.tight_layout(); plt.show()
What to see:
Left panel: deterministic beams (blue) quickly collapse onto a single region — within 10 iterations all 5 beams are within 0.1 of each other. Stochastic beams (red) maintain spread for much longer.
Right panel: spread of deterministic beam drops to ~0 within 5–10 iterations (diversity collapse). Stochastic beam keeps spread > 1 for much of the run.
Trade-off: deterministic converges faster on the best basin it finds first; stochastic explores longer but is more likely to find the global optimum.
Variants
Variant
Twist
When to use
Local Beam Search
Top k chosen deterministically
Convex-ish landscapes; quick convergence
Stochastic Beam Search
Top k chosen probabilistically
Rugged landscapes; preserve exploration
Beam Search (classical, with paths)
Same idea but keeps full paths, not just states
Sequence generation (decoding)
Diverse Beam Search
Penalize beams that are too similar
LLM decoding when you want variety
Where (variants of) Beam Search are used today
The “states + top-k” pattern is everywhere in modern AI:
LLM decoding — when GPT/Claude/Gemini generates text, they use beam search (or its successors) to find probable token sequences. Output diversity comes from temperature sampling, top-p, top-k, etc.
Speech recognition — Whisper and other ASR systems use beam search to decode token sequences from audio
Machine translation — neural translators (early Transformer NMT) use beam search to pick output sentences
Image captioning — beam search over caption sequences
Code generation (Copilot, CodeLLaMA) — beam search variants when greedy decoding is too brittle
Sequence labeling — chunking, POS tagging in NLP
Robotic motion planning — variants of beam search to maintain k candidate trajectories
Where Beam Search was challenged — and by what
Domain
Was beam search, now …
Why
LLM text generation
Nucleus sampling (top-p), temperature
Beam search produces repetitive, “safe” text; sampling is more creative
Continuous optimization
Simulated Annealing, Bayesian Optimization
When the landscape is smooth, gradient-based or model-based methods beat beam expansion
Combinatorial NP-hard
MILP solvers, branch-and-bound
Exhaustive techniques with bounds beat fixed-width beams
Game tree search
Monte Carlo Tree Search (MCTS)
MCTS adaptively allocates rollouts; beam search wastes budget on bad lines
Where beam search still shines: sequence decoding when you want diverse but high-probability outputs, and as a baseline for any “best-k of many candidates” problem.
See also
Loss Surface — the landscape LBS’s k beams traverse; how diversity collapse looks in surface terms