IINemo · smirnovlad · Mar 20, 2026 · Mar 24, 2026 · Mar 24, 2026
diff --git a/docs/tasks/beam_finalization_analysis.md b/docs/tasks/beam_finalization_analysis.md
@@ -0,0 +1,136 @@
+# Analysis: Beam Finalization Logic — `completed_beams or active` vs `completed_beams + active`
+
+> **Key point:** This issue **only triggers when `max_steps` is reached** while some beams are still active. If all beams complete naturally before `max_steps` (the common case for math reasoning), a different code path handles finalization correctly and this logic is never executed.
+
+## The Code in Question
+
+```python
+# Line 703 in strategy_beam_search.py
+# This runs AFTER the main loop ends (max_steps reached, sample still in active_samples)
+
+for sample_id in active_samples:
+    active = sample_beams[sample_id]
+    candidates = completed_beams_by_sample[sample_id] or active   # ← THIS LINE
+    best_beam = self._select_best_beam(candidates)
+```
+
+## How Beam Search Works (Two Exit Paths)
+
+### Exit Path 1: All beams complete inside the loop (line 643) — THE COMMON CASE
+A sample's beams are split every step into `completed` and `active` lists. Completed beams are stored in `completed_beams_by_sample` and removed from the active pool. When `active` becomes empty (all beams finished), `_select_best_beam(completed_beams_by_sample[sample_id])` is called — this only considers completed beams. **No issue here. This is how most math experiments work** — beams generate an answer and emit EOS well before `max_steps`.
+
+### Exit Path 2: `max_steps` reached with beams still active (line 699–704) — THE EDGE CASE
+The main `for step_num in range(max_steps)` loop ends. Some samples may still have active beams. **This only happens when `max_steps` is hit before all beams finish** — e.g., short `max_steps` setting, very hard problems, or tasks like Game of 24 where some beams find an answer quickly while others don't. This is where line 703 runs. At this point:
+- `completed_beams_by_sample[sample_id]` = beams that finished early (EOS, `is_trajectory_complete`, etc.)
+- `sample_beams[sample_id]` = beams still active when max_steps was reached (never produced EOS)
+
+## What `or` Does
+
+Python `or` on lists: `[B1, B2] or [A1, A2, A3]` → `[B1, B2]` (first non-empty wins).
+
+| `completed_beams` | `active` | `candidates` (with `or`) |
+|---|---|---|
+| `[]` | `[A1, A2, A3]` | `[A1, A2, A3]` — falls through to active ✅ |
+| `[B1, B2]` | `[A1, A2, A3]` | `[B1, B2]` — **active beams ignored entirely** |
+| `[B1]` | `[]` | `[B1]` ✅ |
+
+## What `+` Would Do
+
+`[B1, B2] + [A1, A2, A3]` → `[B1, B2, A1, A2, A3]` — all beams considered, `_select_best_beam` picks highest aggregated score.
+
+## Concrete Examples
+
+### Example 1: `or` causes suboptimal selection
+
+```
+beam_width=4, max_steps=30
+
+Step 12: Beam B completes early (model emits EOS), score=0.65
+         → moved to completed_beams
+Steps 13–30: Beams A, C, D continue expanding
+         A accumulates score=0.89, C=0.72, D=0.61
+
+At max_steps:
+  completed_beams = [B (0.65)]  ← non-empty
+  active = [A (0.89), C (0.72), D (0.61)]
+
+  With `or`:  candidates = [B (0.65)] → picks B (0.65) ❌
+  With `+`:   candidates = [B, A, C, D] → picks A (0.89) ✅
+```
+
+Here `or` loses a much better beam.
+
+### Example 2: `or` is actually correct
+
+```
+Game of 24, max_steps=5
+
+Step 3: Beam B finds "= 24", completes, score=0.70
+Steps 4–5: Beams A, C generate more steps but never reach "= 24"
+           A has score=0.85 but trajectory is incomplete (no final answer)
+
+At max_steps:
+  completed_beams = [B (0.70, has answer "= 24")]
+  active = [A (0.85, no answer), C (0.60, no answer)]
+
+  With `or`:  candidates = [B (0.70)] → picks B with valid answer ✅
+  With `+`:   candidates = [B, A, C] → picks A (0.85) but A has NO answer ❌
+```
+
+Here `or` correctly prefers the beam that actually solved the problem.
+
+### Example 3: Mixed — both have answers
+
+```
+Step 15: Beam B completes with answer, score=0.65
+Step 30 (max_steps): Beam A is still "active" but its last step also contains
+                      an answer (answer_pattern detected but is_trajectory_complete=False
+                      due to how the detector works)
+
+  With `or`:  picks B (0.65), misses A which has a better answer
+  With `+`:   picks A (0.85), gets the better answer
+```
+
+## The Core Question
+
+**What does "completed" mean semantically?**
+
+A beam is marked `completed` when `is_trajectory_complete=True` or `is_thinking_complete=True`. This is a **signal from the model** that it considers itself done (emitted EOS or stop token).
+
+An "active" beam at max_steps is one where **the model wanted to keep going but we cut it off**. Its trajectory may or may not contain a usable answer depending on the task.
+
+## My Assessment
+
+**Neither `or` nor `+` is universally correct.** They encode different assumptions:
+
+| Variant | Assumption | Good for | Bad for |
+|---|---|---|---|
+| `or` (current) | Completed beams are inherently better because they contain a full answer | Short tasks (Game of 24), tasks where answer completeness matters | Long math reasoning where max_steps is the normal exit |
+| `+` (proposed) | Score is the best indicator of quality regardless of completion status | Long math reasoning where all beams hit max_steps | Tasks where active beams have no usable answer |
+
+## Recommendation
+
+The safest fix is neither `or` nor `+` — it's **`completed_beams + active` but with a completion bonus or penalty**:
+
+```python
+# Option A: simple — just combine and trust the scores
+candidates = completed_beams_by_sample[sample_id] + active
+
+# Option B: prefer completed but don't ignore active entirely
+# (add a small bonus to completed beams' scores in _select_best_beam)
+
+# Option C: task-aware — check if active beams have answers
+candidates = completed_beams_by_sample[sample_id] + [
+    b for b in active if self._has_answer_content(b["steps"][-1])
+]
+if not candidates:
+    candidates = completed_beams_by_sample[sample_id] or active
+```
+
+**For now, I recommend Option A (`+`)** because:
+1. In practice, most math benchmarks run all beams to max_steps (bug doesn't trigger with `or` anyway)
+2. When it does trigger, `+` gives the better result in the common case (Example 1)
+3. For Game of 24 / ToT verification, we should separately validate that completed beams actually have valid answers — that's an evaluator concern, not a beam selection concern
+4. The scorer's job is to assign scores that reflect answer quality — we should trust it
+
+But this should be tested on Game of 24 specifically to confirm.
diff --git a/docs/tasks/tot_verification.md b/docs/tasks/tot_verification.md
@@ -0,0 +1,87 @@
+# Task: Tree of Thoughts — Implementation Verification
+
+## Goal
+
+Verify that our beam search implementation (used as ToT) correctly reproduces results from the original paper, then run experiments with Qwen2.5-Math-7B-Instruct.
+
+## Background
+
+- **Paper**: [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) (Yao et al., 2023)
+- **Our implementation**: Beam search strategy (`llm_tts/strategies/strategy_beam_search.py`) with LLM-as-a-critic scorer (`llm_tts/scorers/step_scorer_llm_critic.py`), introduced in PR #161
+- **Original code**: https://github.com/princeton-nlp/tree-of-thought-llm
+- **Original prompts**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/src/tot/prompts
+- **Original trajectories (for comparison)**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/logs
+
+## Phase 1: Reproduce original paper results (Game of 24 + GPT-4)
+
+The paper reports **74% success rate** on Game of 24 (indices 900–999, 100 puzzles) using GPT-4 with ToT (b=5).
+
+### Steps
+
+1. **Compare prompts with original**
+   - Our prompts: `config/prompts/tree-of-thought/game24/` (propose_fewshot.txt, value_intermediate.txt, value_final.txt)
+   - Original prompts: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/src/tot/prompts/game24.py
+   - Ensure propose prompt, value prompt, and step format match exactly
+
+2. **Create experiment config**
+   - Dataset: `config/dataset/game24.yaml` (already exists, indices 900–1000)
+   - Model: GPT-4 via OpenRouter (`openai/gpt-4` or `openai/gpt-4-turbo`)
+   - Strategy: beam search with `beam_width=5` (paper uses b=5)
+   - Scorer: LLM-as-a-critic (`config/scorer/llm_critic.yaml`)
+   - Create config at `config/experiments/beam_search/game24/beam_search_openrouter_gpt4_game24_llm_critic.yaml`
+
+3. **Implement Game of 24 evaluator**
+   - The task is NOT exact match — need to verify the expression equals 24
+   - Check if expression uses exactly the 4 given numbers
+   - Parse and evaluate arithmetic expression
+   - May need a custom evaluator in `llm_tts/evaluation/`
+
+4. **Run experiment**
+   ```bash
+   CUDA_VISIBLE_DEVICES="" python scripts/run_tts_eval.py \
+     --config-path ../config \
+     --config-name experiments/beam_search/game24/beam_search_openrouter_gpt4_game24_llm_critic \
+     dataset.subset=10  # start with 10, then full 100
+   ```
+
+5. **Compare results**
+   - Target: ~74% success rate (paper result with GPT-4)
+   - Compare trajectories with original logs: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/logs
+   - If significantly off, debug: check beam expansion, scoring, pruning behavior
+
+### Key differences to watch for
+- Our beam search does step-level scoring; original ToT does value-based voting
+- Prompt format for "propose" and "value" steps must match paper exactly
+- Temperature and sampling parameters must match (paper uses temperature=0.7 for propose, temperature=1.0 for value)
+
+## Phase 2: Run experiments with Qwen2.5-Math-7B-Instruct (4 math datasets)
+
+After Phase 1 confirms correctness, run beam search with LLM-as-a-critic on:
+
+1. **MATH-500** — `config/experiments/beam_search/math500/`
+2. **OlympiadBench** — `config/experiments/beam_search/olympiadbench/`
+3. **GaoKao 2023 En** — `config/experiments/beam_search/gaokao2023en/`
+4. **Minerva Math** — `config/experiments/beam_search/minerva_math/`
+
+### For each dataset
+- Model: Qwen2.5-Math-7B-Instruct (vLLM backend, 2 GPUs)
+- Scorer: LLM-as-a-critic
+- Beam width: 4 (our standard)
+- Seeds: 42, 43, 44 (3 seeds per dataset)
+- Configs already exist in `config/experiments/beam_search/*/window_all/mean/` with `llm_critic` suffix
+
+### Submission
+```bash
+./scripts/local/submit.sh --strategy beam_search --dataset math500 --scorers llm_critic --seeds 3
+./scripts/local/submit.sh --strategy beam_search --dataset olympiadbench --scorers llm_critic --seeds 3
+./scripts/local/submit.sh --strategy beam_search --dataset gaokao2023en --scorers llm_critic --seeds 3
+./scripts/local/submit.sh --strategy beam_search --dataset minerva_math --scorers llm_critic --seeds 3
+```
+
+## References
+
+- **Paper**: Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models", NeurIPS 2023. https://arxiv.org/abs/2305.10601
+- **Original code**: https://github.com/princeton-nlp/tree-of-thought-llm
+- **Original prompts**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/src/tot/prompts
+- **Original trajectories**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/logs
+- **Our LLM-as-a-critic PR**: https://github.com/IINemo/thinkbooster/pull/161