diff --git a/docs/tasks/beam_finalization_analysis.md b/docs/tasks/beam_finalization_analysis.md new file mode 100644 index 00000000..1bca9544 --- /dev/null +++ b/docs/tasks/beam_finalization_analysis.md @@ -0,0 +1,136 @@ +# Analysis: Beam Finalization Logic — `completed_beams or active` vs `completed_beams + active` + +> **Key point:** This issue **only triggers when `max_steps` is reached** while some beams are still active. If all beams complete naturally before `max_steps` (the common case for math reasoning), a different code path handles finalization correctly and this logic is never executed. + +## The Code in Question + +```python +# Line 703 in strategy_beam_search.py +# This runs AFTER the main loop ends (max_steps reached, sample still in active_samples) + +for sample_id in active_samples: + active = sample_beams[sample_id] + candidates = completed_beams_by_sample[sample_id] or active # ← THIS LINE + best_beam = self._select_best_beam(candidates) +``` + +## How Beam Search Works (Two Exit Paths) + +### Exit Path 1: All beams complete inside the loop (line 643) — THE COMMON CASE +A sample's beams are split every step into `completed` and `active` lists. Completed beams are stored in `completed_beams_by_sample` and removed from the active pool. When `active` becomes empty (all beams finished), `_select_best_beam(completed_beams_by_sample[sample_id])` is called — this only considers completed beams. **No issue here. This is how most math experiments work** — beams generate an answer and emit EOS well before `max_steps`. + +### Exit Path 2: `max_steps` reached with beams still active (line 699–704) — THE EDGE CASE +The main `for step_num in range(max_steps)` loop ends. Some samples may still have active beams. **This only happens when `max_steps` is hit before all beams finish** — e.g., short `max_steps` setting, very hard problems, or tasks like Game of 24 where some beams find an answer quickly while others don't. This is where line 703 runs. At this point: +- `completed_beams_by_sample[sample_id]` = beams that finished early (EOS, `is_trajectory_complete`, etc.) +- `sample_beams[sample_id]` = beams still active when max_steps was reached (never produced EOS) + +## What `or` Does + +Python `or` on lists: `[B1, B2] or [A1, A2, A3]` → `[B1, B2]` (first non-empty wins). + +| `completed_beams` | `active` | `candidates` (with `or`) | +|---|---|---| +| `[]` | `[A1, A2, A3]` | `[A1, A2, A3]` — falls through to active ✅ | +| `[B1, B2]` | `[A1, A2, A3]` | `[B1, B2]` — **active beams ignored entirely** | +| `[B1]` | `[]` | `[B1]` ✅ | + +## What `+` Would Do + +`[B1, B2] + [A1, A2, A3]` → `[B1, B2, A1, A2, A3]` — all beams considered, `_select_best_beam` picks highest aggregated score. + +## Concrete Examples + +### Example 1: `or` causes suboptimal selection + +``` +beam_width=4, max_steps=30 + +Step 12: Beam B completes early (model emits EOS), score=0.65 + → moved to completed_beams +Steps 13–30: Beams A, C, D continue expanding + A accumulates score=0.89, C=0.72, D=0.61 + +At max_steps: + completed_beams = [B (0.65)] ← non-empty + active = [A (0.89), C (0.72), D (0.61)] + + With `or`: candidates = [B (0.65)] → picks B (0.65) ❌ + With `+`: candidates = [B, A, C, D] → picks A (0.89) ✅ +``` + +Here `or` loses a much better beam. + +### Example 2: `or` is actually correct + +``` +Game of 24, max_steps=5 + +Step 3: Beam B finds "= 24", completes, score=0.70 +Steps 4–5: Beams A, C generate more steps but never reach "= 24" + A has score=0.85 but trajectory is incomplete (no final answer) + +At max_steps: + completed_beams = [B (0.70, has answer "= 24")] + active = [A (0.85, no answer), C (0.60, no answer)] + + With `or`: candidates = [B (0.70)] → picks B with valid answer ✅ + With `+`: candidates = [B, A, C] → picks A (0.85) but A has NO answer ❌ +``` + +Here `or` correctly prefers the beam that actually solved the problem. + +### Example 3: Mixed — both have answers + +``` +Step 15: Beam B completes with answer, score=0.65 +Step 30 (max_steps): Beam A is still "active" but its last step also contains + an answer (answer_pattern detected but is_trajectory_complete=False + due to how the detector works) + + With `or`: picks B (0.65), misses A which has a better answer + With `+`: picks A (0.85), gets the better answer +``` + +## The Core Question + +**What does "completed" mean semantically?** + +A beam is marked `completed` when `is_trajectory_complete=True` or `is_thinking_complete=True`. This is a **signal from the model** that it considers itself done (emitted EOS or stop token). + +An "active" beam at max_steps is one where **the model wanted to keep going but we cut it off**. Its trajectory may or may not contain a usable answer depending on the task. + +## My Assessment + +**Neither `or` nor `+` is universally correct.** They encode different assumptions: + +| Variant | Assumption | Good for | Bad for | +|---|---|---|---| +| `or` (current) | Completed beams are inherently better because they contain a full answer | Short tasks (Game of 24), tasks where answer completeness matters | Long math reasoning where max_steps is the normal exit | +| `+` (proposed) | Score is the best indicator of quality regardless of completion status | Long math reasoning where all beams hit max_steps | Tasks where active beams have no usable answer | + +## Recommendation + +The safest fix is neither `or` nor `+` — it's **`completed_beams + active` but with a completion bonus or penalty**: + +```python +# Option A: simple — just combine and trust the scores +candidates = completed_beams_by_sample[sample_id] + active + +# Option B: prefer completed but don't ignore active entirely +# (add a small bonus to completed beams' scores in _select_best_beam) + +# Option C: task-aware — check if active beams have answers +candidates = completed_beams_by_sample[sample_id] + [ + b for b in active if self._has_answer_content(b["steps"][-1]) +] +if not candidates: + candidates = completed_beams_by_sample[sample_id] or active +``` + +**For now, I recommend Option A (`+`)** because: +1. In practice, most math benchmarks run all beams to max_steps (bug doesn't trigger with `or` anyway) +2. When it does trigger, `+` gives the better result in the common case (Example 1) +3. For Game of 24 / ToT verification, we should separately validate that completed beams actually have valid answers — that's an evaluator concern, not a beam selection concern +4. The scorer's job is to assign scores that reflect answer quality — we should trust it + +But this should be tested on Game of 24 specifically to confirm. diff --git a/docs/tasks/tot_verification.md b/docs/tasks/tot_verification.md new file mode 100644 index 00000000..c11dc707 --- /dev/null +++ b/docs/tasks/tot_verification.md @@ -0,0 +1,87 @@ +# Task: Tree of Thoughts — Implementation Verification + +## Goal + +Verify that our beam search implementation (used as ToT) correctly reproduces results from the original paper, then run experiments with Qwen2.5-Math-7B-Instruct. + +## Background + +- **Paper**: [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601) (Yao et al., 2023) +- **Our implementation**: Beam search strategy (`llm_tts/strategies/strategy_beam_search.py`) with LLM-as-a-critic scorer (`llm_tts/scorers/step_scorer_llm_critic.py`), introduced in PR #161 +- **Original code**: https://github.com/princeton-nlp/tree-of-thought-llm +- **Original prompts**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/src/tot/prompts +- **Original trajectories (for comparison)**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/logs + +## Phase 1: Reproduce original paper results (Game of 24 + GPT-4) + +The paper reports **74% success rate** on Game of 24 (indices 900–999, 100 puzzles) using GPT-4 with ToT (b=5). + +### Steps + +1. **Compare prompts with original** + - Our prompts: `config/prompts/tree-of-thought/game24/` (propose_fewshot.txt, value_intermediate.txt, value_final.txt) + - Original prompts: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/src/tot/prompts/game24.py + - Ensure propose prompt, value prompt, and step format match exactly + +2. **Create experiment config** + - Dataset: `config/dataset/game24.yaml` (already exists, indices 900–1000) + - Model: GPT-4 via OpenRouter (`openai/gpt-4` or `openai/gpt-4-turbo`) + - Strategy: beam search with `beam_width=5` (paper uses b=5) + - Scorer: LLM-as-a-critic (`config/scorer/llm_critic.yaml`) + - Create config at `config/experiments/beam_search/game24/beam_search_openrouter_gpt4_game24_llm_critic.yaml` + +3. **Implement Game of 24 evaluator** + - The task is NOT exact match — need to verify the expression equals 24 + - Check if expression uses exactly the 4 given numbers + - Parse and evaluate arithmetic expression + - May need a custom evaluator in `llm_tts/evaluation/` + +4. **Run experiment** + ```bash + CUDA_VISIBLE_DEVICES="" python scripts/run_tts_eval.py \ + --config-path ../config \ + --config-name experiments/beam_search/game24/beam_search_openrouter_gpt4_game24_llm_critic \ + dataset.subset=10 # start with 10, then full 100 + ``` + +5. **Compare results** + - Target: ~74% success rate (paper result with GPT-4) + - Compare trajectories with original logs: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/logs + - If significantly off, debug: check beam expansion, scoring, pruning behavior + +### Key differences to watch for +- Our beam search does step-level scoring; original ToT does value-based voting +- Prompt format for "propose" and "value" steps must match paper exactly +- Temperature and sampling parameters must match (paper uses temperature=0.7 for propose, temperature=1.0 for value) + +## Phase 2: Run experiments with Qwen2.5-Math-7B-Instruct (4 math datasets) + +After Phase 1 confirms correctness, run beam search with LLM-as-a-critic on: + +1. **MATH-500** — `config/experiments/beam_search/math500/` +2. **OlympiadBench** — `config/experiments/beam_search/olympiadbench/` +3. **GaoKao 2023 En** — `config/experiments/beam_search/gaokao2023en/` +4. **Minerva Math** — `config/experiments/beam_search/minerva_math/` + +### For each dataset +- Model: Qwen2.5-Math-7B-Instruct (vLLM backend, 2 GPUs) +- Scorer: LLM-as-a-critic +- Beam width: 4 (our standard) +- Seeds: 42, 43, 44 (3 seeds per dataset) +- Configs already exist in `config/experiments/beam_search/*/window_all/mean/` with `llm_critic` suffix + +### Submission +```bash +./scripts/local/submit.sh --strategy beam_search --dataset math500 --scorers llm_critic --seeds 3 +./scripts/local/submit.sh --strategy beam_search --dataset olympiadbench --scorers llm_critic --seeds 3 +./scripts/local/submit.sh --strategy beam_search --dataset gaokao2023en --scorers llm_critic --seeds 3 +./scripts/local/submit.sh --strategy beam_search --dataset minerva_math --scorers llm_critic --seeds 3 +``` + +## References + +- **Paper**: Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models", NeurIPS 2023. https://arxiv.org/abs/2305.10601 +- **Original code**: https://github.com/princeton-nlp/tree-of-thought-llm +- **Original prompts**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/src/tot/prompts +- **Original trajectories**: https://github.com/princeton-nlp/tree-of-thought-llm/tree/master/logs +- **Our LLM-as-a-critic PR**: https://github.com/IINemo/thinkbooster/pull/161