Local Beam Search

methods-of-ai

Local Beam Search maintains k states simultaneously instead of one. At each step, it generates all successors of all k current states, then keeps the best k overall. This pools information across the k beams — promising regions get more search effort, dead-end beams die out.

It looks like “k parallel Hill Climbs” but is fundamentally different: states share information, so beams concentrate on promising regions. This is also its weakness — beams can collapse onto a single region, losing diversity.

🔦 What IS a “beam” — and why do we need k of them?

A "beam" = one candidate solution being tracked.

Think of a beam as a flashlight pointed at the search space. Hill Climbing has 1 flashlight (one current state). Local Beam Search has k flashlights at once, all moving in parallel.

Why k > 1? A single flashlight (Hill Climbing) gets stuck the moment it walks into a local optimum — there’s nothing better in its immediate neighborhood, so it stops. With k flashlights spread across the landscape, the chance that at least one lands in the global optimum’s basin is dramatically higher.

The concrete object: A beam is just a value — for our example f(x) = x² + 10·sin(x), each beam is a single number x. For k=4, we track 4 such numbers. For an 8-queens problem, each beam would be a full board configuration (a state). For LLM text generation, each beam is a partial sentence.

🎯 How are the next k beams chosen? — the central mechanism

This is the part that confuses people. Here is the exact procedure for one iteration, with k=4 and b=5 (each beam generates 5 successors):

Step 1 — Generate candidates.
Each of the 4 current beams produces b=5 successors (small random perturbations).
→ Total candidate pool: 4 × 5 = 20 candidates.

👉 HOW exactly is a child generated? (the neighborhood function)
A “child” is whatever the neighborhood function of the problem returns — LBS itself doesn’t care about the mechanism. Common choices:

Problem type How children are generated Example
Continuous (our f(x)) child = parent + N(0, step_size) (Gaussian noise) parent=2.0, step=0.5 → child might be 2.3, 1.7, 2.8, 1.5, 2.1
Continuous (alt.) child = parent + uniform(−step, +step) uniform jitter instead of Gaussian
8-Queens Move one queen to a different row in its column 56 possible children per state (8 cols × 7 alt. rows)
TSP Swap two cities, or reverse a segment (2-opt) up to n·(n−1)/2 children
Boolean SAT Flip the value of one variable exactly n children for n variables
LLM decoding Append one possible next token vocab_size children (~50k for GPT-2)

In our code:
def neighbors(x, step=STEP_SIZE, n=B):
    return [x + random.gauss(0, step) for _ in range(n)]
Each call returns n=B=5 children, each = parent + Gaussian noise with std step_size. So:

STEP_SIZE=0.1 → children very close to parent (refine locally)

STEP_SIZE=2.0 → children scattered widely (explore broadly)

This is the single knob that controls exploration vs. exploitation in our toy example. In real problems it might be the mutation rate, the temperature, or the structure of the move set — but the role is the same.

Problem type	How children are generated	Example
Continuous (our `f(x)`)	`child = parent + N(0, step_size)` (Gaussian noise)	`parent=2.0, step=0.5` → child might be `2.3, 1.7, 2.8, 1.5, 2.1`
Continuous (alt.)	`child = parent + uniform(−step, +step)`	uniform jitter instead of Gaussian
8-Queens	Move one queen to a different row in its column	56 possible children per state (8 cols × 7 alt. rows)
TSP	Swap two cities, or reverse a segment (2-opt)	up to `n·(n−1)/2` children
Boolean SAT	Flip the value of one variable	exactly `n` children for n variables
LLM decoding	Append one possible next token	`vocab_size` children (~50k for GPT-2)

Step 2 — Pool them ALL together.
Throw all 20 candidates into a single bucket. Forget which parent produced which child. They’re all equal contenders now.

Step 3 — Rank by fitness.
Sort all 20 candidates by f-value (smallest = best, since we’re minimizing).

Step 4 — Keep the top k.
Take the 4 best out of all 20. These become the new beams for the next iteration. The other 16 are discarded.

The single most important consequence

Because all 20 candidates compete in one global ranking, a single “good” parent can have all 4 of its children chosen — wiping out the other 3 lineages entirely. This is the “rich-get-richer” effect and the source of LBS’s diversity-collapse problem.

Example: If Parent A is near the global min (f ≈ −7) and Parents B, C, D are higher up (f ≈ 10, 15, 20), Parent A’s 5 children will likely all have lower f-values than any of B/C/D’s children → all 4 survivors come from Parent A → B, C, D’s lineages are extinct.

Visual cue in the step-by-step plot below:

Each candidate dot is colored by its parent.
The 4 survivors get a black ring.
Count the colors inside the rings — if all 4 are the same color, that parent monopolized the next generation. That’s the collapse mechanism in action.

The algorithm

1. INITIALIZE  — start with k random states
2. EXPAND      — generate ALL successors of ALL k current states
3. SELECT      — pick the best k from the combined successor pool
4. CHECK       — if any state is a goal → return it
                  if no improvement possible → terminate
5. REPEAT

Two key differences from k independent Hill Climbs:

Step 3 pools successors across all beams — winning states from one beam can “infect” another
A high-fitness region naturally attracts more beams over time → rich get richer

⚠️ Beam Search vs. Parallel Hill Climbing — the classic exam trap

This is one of the most-asked exam comparisons:

	Parallel Hill Climbing	Local Beam Search
k states	✓	✓
Each step	Each state picks ITS OWN best successor	All k·b successors pooled, top k kept
Information sharing between beams	❌ — independent searches	✓ — globally best states win
Failure mode	Each beam stuck in its own local optimum	All beams collapse into one region (diversity collapse)

⚠️ Pure Local Beam Search often degenerates into “k copies of one state” because as soon as one beam finds a hot region, all k beams flock there. The fix is Stochastic Beam Search.

Stochastic Beam Search

Instead of deterministically picking the top k successors, sample k of them with probability proportional to fitness. This keeps diversity — weaker beams sometimes survive, preserving exploration.

It’s essentially the same trick as fitness-proportional selection in Genetic Algorithms — population-based search relies on stochastic selection to avoid premature convergence.

Step-by-step visualization — the selection mechanism made visible

The single thing that confuses people about LBS is how the next k beams get chosen. Here’s a 6-panel walkthrough on f(x) = x² + 10·sin(x). Each panel shows ONE iteration:

Big colored circles = the k current beams
Small gray dots = ALL k·b successors generated this step (the candidate pool)
Highlighted with black ring = the top-k selected for the next iteration

Watch how the pool of small dots gets filtered down to k big circles — and how those k often all come from one or two parents (rich-get-richer effect).

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
# In normal Python (terminal / Jupyter), delete the next 2 lines.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import random
import matplotlib.pyplot as plt
 
# ════════════════════════════════════════════════════════════════
# 🎛️  Change these → re-run to see how the mechanism reacts
# ════════════════════════════════════════════════════════════════
K          = 4      # number of beams (try 2, 4, 8) — keep ≤ 10 for the color palette
B          = 5      # successors per beam (try 3, 5, 10) — pool size = K·B
STEP_SIZE  = 0.5    # neighborhood radius (try 0.2 vs 1.5)
N_PANELS   = 6      # how many iterations to plot (snapshots = N_PANELS)
SEED       = 0      # change for a different random start
# ════════════════════════════════════════════════════════════════
 
def f(x):
    return x**2 + 10 * np.sin(x)
 
def neighbors(x, step=STEP_SIZE, n=B):
    """Each beam generates n random neighbors."""
    return [x + random.gauss(0, step) for _ in range(n)]
 
# --- Run LBS and record the selection at every step ---
random.seed(SEED)
beams = [random.uniform(-5, 5) for _ in range(K)]
# Each snapshot: (parents_before_step, list_of_(parent_idx, child_x), selected_top_k)
snapshots = [(list(beams), [], [])]
 
for _ in range(N_PANELS - 1):
    pairs = [(p_idx, child) for p_idx, parent in enumerate(beams) for child in neighbors(parent)]
    succ_xs = [c for _, c in pairs]
    sorted_succs = sorted(succ_xs, key=f)
    selected = sorted_succs[:K]
    snapshots.append((list(beams), pairs, selected))
    beams = selected
 
# --- Plot N_PANELS panels in a 2-row grid (generous spacing) ---
n_cols = (N_PANELS + 1) // 2
fig, axes = plt.subplots(2, n_cols, figsize=(4.2 * n_cols, 9))
axes = axes.flatten()
x_grid = np.linspace(-5, 5, 400)
palette = plt.cm.tab10(np.linspace(0, 1, max(K, 4)))
 
for i, (current_beams, pairs, selected) in enumerate(snapshots[:N_PANELS]):
    ax = axes[i]
    ax.plot(x_grid, [f(x) for x in x_grid], color='lightgray', lw=2.2, zorder=1)
 
    # Parent→child lines (color = parent's color) — both ends ON the curve
    for p_idx, child_x in pairs:
        parent_x = current_beams[p_idx]
        ax.plot([parent_x, child_x], [f(parent_x), f(child_x)],
                color=palette[p_idx], alpha=0.35, lw=1.0, zorder=2)
 
    # Candidates ON the curve — colored by parent (medium dots)
    for p_idx, child_x in pairs:
        ax.scatter(child_x, f(child_x), s=70, color=palette[p_idx],
                   edgecolor='black', linewidth=0.6, alpha=0.85, zorder=4)
 
    # Parents (colored squares) ON the curve
    for b, c in zip(current_beams, palette):
        ax.scatter(b, f(b), s=230, color=c, marker='s', edgecolor='black',
                   linewidth=1.8, zorder=6, alpha=0.95)
 
    # Survivors → BIG hollow ring around them (visually encloses, doesn't obscure)
    for s in selected:
        ax.scatter(s, f(s), s=520, facecolor='none', edgecolor='black',
                   linewidth=2.6, zorder=5)
 
    ax.axvline(-1.31, color='green', ls=':', alpha=0.5)
    ax.axvline(3.85,  color='red',   ls=':', alpha=0.5)
    n_cand = len(pairs)
    title = f'iter {i} — initial {K} beams' if not pairs else \
            f'iter {i} — {n_cand} candidates → top {K} survive'
    ax.set_title(title, fontsize=11)
    ax.set_ylim(-14, 32); ax.set_xlim(-5.5, 5.5)
    ax.set_xticks([]); ax.set_yticks([])
    ax.set_xlabel('x', fontsize=8); ax.set_ylabel('f(x)', fontsize=8)
 
# Legend on first panel
axes[0].text(-5.2, 30,
             f'■ square = parent beam ({K} total)\n'
             f'● dot    = candidate (colored by parent, {K*B} total)\n'
             f'○ ring   = selected survivor (top {K})\n'
             f'lines    = parent → its children',
             fontsize=8, va='top',
             bbox=dict(boxstyle='round', facecolor='white', alpha=0.92))
plt.suptitle(f'Local Beam Search — step-by-step (K={K} beams, B={B} successors → {K*B} candidates per iter)\n'
             f'Count colors inside the rings: if all {K} survivors share 1 color, that parent monopolized the next gen.',
             fontsize=12)
plt.tight_layout(); plt.subplots_adjust(top=0.88, hspace=0.30, wspace=0.18); plt.show()

What to watch for:

iter 0: 4 colored beams scattered randomly
iter 1: 20 small gray dots appear (each beam shot 5 random neighbors). The 4 with black rings are the survivors — they go on as the colored beams in iter 2.
iter 2-3: Notice that often ALL 4 survivors come from just 1-2 parents — the parents in the good region produced the best children, so their lineage dominates.
iter 4-5: All 4 beams clustered tightly in one region. Diversity has collapsed.

This is the central mechanism: at each step, all parents’ successors compete globally, not within-parent. So a beam in a bad region whose children are bad → that beam’s lineage dies out. A beam in a great region whose children are great → that beam’s lineage gets multiplied.

Direct comparison: Local Beam Search vs. Parallel Hill Climbing

Same starting points, same neighborhood function, only the selection rule differs:

🐍 Code anzeigen / ausblenden

import random
import matplotlib.pyplot as plt
import numpy as np
 
def f(x):
    return x**2 + 10 * np.sin(x)
 
def neighbors(x, step=0.5, n=5):
    return [x + random.gauss(0, step) for _ in range(n)]
 
def run_parallel_hc(beams, steps=15):
    """Each beam picks its OWN best successor — no sharing."""
    history = [list(beams)]
    for _ in range(steps):
        next_beams = [min(neighbors(b), key=f) for b in beams]
        # Only move if it's an improvement (HC rule)
        next_beams = [n if f(n) < f(b) else b for n, b in zip(next_beams, beams)]
        beams = next_beams
        history.append(list(beams))
    return history
 
def run_local_beam_search(beams, k, steps=15):
    """All successors pooled — top-k overall survive."""
    history = [list(beams)]
    for _ in range(steps):
        all_succs = [child for parent in beams for child in neighbors(parent)]
        beams = sorted(all_succs, key=f)[:k]
        history.append(list(beams))
    return history
 
random.seed(42)
initial = [-3.5, -1.0, 2.5, 4.5]                              # 4 starting points
 
h_par  = run_parallel_hc(initial.copy())
random.seed(42)                                               # same RNG for fair compare
h_lbs  = run_local_beam_search(initial.copy(), k=4)
 
# --- Plot side by side ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
x_grid = np.linspace(-5, 5, 400)
 
for ax, history, title in [(axes[0], h_par, 'Parallel Hill Climbing\n(each beam independent)'),
                            (axes[1], h_lbs, 'Local Beam Search\n(all successors pooled, top-k survives)')]:
    ax.plot(x_grid, [f(x) for x in x_grid], color='lightgray', lw=2.5)
    # Plot each beam's trajectory as a line of points colored by iteration
    history_arr = np.array(history)
    iters = np.arange(len(history_arr))
    for beam_idx in range(history_arr.shape[1]):
        xs = history_arr[:, beam_idx]
        ax.scatter(xs, [f(x) for x in xs], c=iters, cmap='viridis',
                   s=60, alpha=0.7, edgecolor='none')
    ax.axvline(-1.31, color='green', ls=':', alpha=0.5, label='global min')
    ax.axvline(3.85,  color='red',   ls=':', alpha=0.5, label='local min')
    ax.set_title(title); ax.legend()
    ax.set_xlabel('x'); ax.set_ylabel('f(x)')
 
plt.tight_layout(); plt.show()

The decisive contrast:

Parallel HC (left): 4 trajectories stay independent. Each beam descends into whichever basin it started in — 2 beams end at the global min, 2 end at the local min. No information transfer.
LBS (right): 4 trajectories collapse onto one region — typically the one that produced the best successors earliest. You might end up with all 4 beams in the global min (good!) — or all 4 stuck in the local min (catastrophic!).

This is exactly why Stochastic Beam Search exists — without stochasticity in selection, LBS gambles its diversity away.

See it in code (original convergence-tracking view)

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
# In normal Python (terminal / Jupyter), delete the next 2 lines.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import random
import matplotlib.pyplot as plt
 
# The same f(x) we used for HC + SA
def f(x):
    return x**2 + 10 * np.sin(x)
 
def neighbors(x, step=0.3, n=20):
    return [x + random.gauss(0, step) for _ in range(n)]
 
def local_beam_search(f, k=5, max_iter=100, stochastic=False):
    states = [random.uniform(-5, 5) for _ in range(k)]
    history = [list(states)]
    for _ in range(max_iter):
        # Expand: all successors of all states
        succs = [(s, f(s)) for st in states for s in neighbors(st)]
        if stochastic:
            # Stochastic: sample k with prob ∝ 1/f (we're minimizing → smaller is better)
            weights = [1 / (s[1] + 100) for s in succs]      # +100 to keep positive
            picks = random.choices(succs, weights=weights, k=k)
            states = [p[0] for p in picks]
        else:
            # Deterministic: top k by fitness (minimization → smallest f)
            states = [s for s, _ in sorted(succs, key=lambda x: x[1])[:k]]
        history.append(list(states))
    return history
 
# --- Compare: deterministic vs stochastic beam search ---
random.seed(42)
h_det  = local_beam_search(f, k=5, stochastic=False)
random.seed(42)
h_stoc = local_beam_search(f, k=5, stochastic=True)
 
# --- Diversity metric: spread of the k states ---
def spread(states):
    return max(states) - min(states)
 
div_det  = [spread(s) for s in h_det]
div_stoc = [spread(s) for s in h_stoc]
 
# --- Plot ---
fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))
 
# Left: positions of the k beams over time (deterministic — they all clump)
for i in range(5):
    axes[0].plot([h[i] for h in h_det], color='steelblue', alpha=0.4, lw=1)
for i in range(5):
    axes[0].plot([h[i] for h in h_stoc], color='red', alpha=0.4, lw=1)
axes[0].set(xlabel='iteration', ylabel='x position of k beams',
            title='Beam positions over time\n(blue=deterministic, red=stochastic)')
axes[0].axhline(-1.31, color='green', ls=':', alpha=0.5, label='global min')
axes[0].axhline(3.85,  color='gray',  ls=':', alpha=0.5, label='local min (trap)')
axes[0].legend()
 
# Right: diversity (spread) over time
axes[1].plot(div_det,  color='steelblue', lw=2, label='deterministic beam')
axes[1].plot(div_stoc, color='red',       lw=2, label='stochastic beam')
axes[1].set(xlabel='iteration', ylabel='spread (max x − min x) of k beams',
            title='Diversity collapse — deterministic vs. stochastic beam')
axes[1].legend(); axes[1].grid(alpha=0.3)
 
plt.tight_layout(); plt.show()

What to see:

Left panel: deterministic beams (blue) quickly collapse onto a single region — within 10 iterations all 5 beams are within 0.1 of each other. Stochastic beams (red) maintain spread for much longer.
Right panel: spread of deterministic beam drops to ~0 within 5–10 iterations (diversity collapse). Stochastic beam keeps spread > 1 for much of the run.
Trade-off: deterministic converges faster on the best basin it finds first; stochastic explores longer but is more likely to find the global optimum.

Variants

Variant	Twist	When to use
Local Beam Search	Top k chosen deterministically	Convex-ish landscapes; quick convergence
Stochastic Beam Search	Top k chosen probabilistically	Rugged landscapes; preserve exploration
Beam Search (classical, with paths)	Same idea but keeps full paths, not just states	Sequence generation (decoding)
Diverse Beam Search	Penalize beams that are too similar	LLM decoding when you want variety

Where (variants of) Beam Search are used today

The “states + top-k” pattern is everywhere in modern AI:

LLM decoding — when GPT/Claude/Gemini generates text, they use beam search (or its successors) to find probable token sequences. Output diversity comes from temperature sampling, top-p, top-k, etc.
Speech recognition — Whisper and other ASR systems use beam search to decode token sequences from audio
Machine translation — neural translators (early Transformer NMT) use beam search to pick output sentences
Image captioning — beam search over caption sequences
Code generation (Copilot, CodeLLaMA) — beam search variants when greedy decoding is too brittle
Sequence labeling — chunking, POS tagging in NLP
Robotic motion planning — variants of beam search to maintain k candidate trajectories

Where Beam Search was challenged — and by what

Domain	Was beam search, now …	Why
LLM text generation	Nucleus sampling (top-p), temperature	Beam search produces repetitive, “safe” text; sampling is more creative
Continuous optimization	Simulated Annealing, Bayesian Optimization	When the landscape is smooth, gradient-based or model-based methods beat beam expansion
Combinatorial NP-hard	MILP solvers, branch-and-bound	Exhaustive techniques with bounds beat fixed-width beams
Game tree search	Monte Carlo Tree Search (MCTS)	MCTS adaptively allocates rollouts; beam search wastes budget on bad lines

Where beam search still shines: sequence decoding when you want diverse but high-probability outputs, and as a baseline for any “best-k of many candidates” problem.

Brain Online

Explorer

Local Beam Search

Local Beam Search

🔦 What IS a “beam” — and why do we need k of them?

🎯 How are the next k beams chosen? — the central mechanism

The algorithm

⚠️ Beam Search vs. Parallel Hill Climbing — the classic exam trap

Stochastic Beam Search

Step-by-step visualization — the selection mechanism made visible

Direct comparison: Local Beam Search vs. Parallel Hill Climbing

See it in code (original convergence-tracking view)

Variants

Where (variants of) Beam Search are used today

Where Beam Search was challenged — and by what

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis