[BUG] RFOptuna eval path (PR #234): 4 bugs prevent end-to-end sharded Optuna RAG runs (categorical labels, missing pipeline key, query-actor engine reuse, replacement-pipeline rate limiter)

## Bug Description
Running `RFOptuna` over the RAG/evals path with `num_shards > 1` (sharded, prune-and-relaunch
Optuna search) fails at four distinct points. Each is independently reproducible and has a small,
local fix. With all four applied, sharded Optuna RAG search runs end-to-end: trials complete, the
median pruner prunes after shard boundaries, and replacement pipelines are suggested, launched,
and complete normally (verified at `num_shards` = 1, 4, and 8 on a 4×L4 box).
 
The four issues, in the order they surface during a run:
 
1. `_object_labels` crashes on dict/None categorical knob choices (`vars()` on a non-object).
2. `_template_to_leaf_evals` omits the `pipeline` key, so the controller `KeyError`s when
   launching a config.
3. `query_actor.py` deletes the inference engine and then trusts a stale hash on reuse, causing
   `AttributeError` under sharding.
4. The controller's replacement-pipeline path doesn't inherit the parent's rate limiter /
   max-completion-tokens, so relaunched trials fail to initialize their inference engine.

## To Reproduce
Steps to reproduce the behavior:
Configure an `RFOptuna` search over a RAG eval experiment with a categorical knob whose choices
are dicts (e.g. a reranker spec) **or** include `None`, set `num_shards = 4`, and run
`experiment.run_evals(...)`. 

The run fails at bug 1 during Optuna label construction; fixing each
bug in turn surfaces the next. (Single-shard `num_shards=1` partially masks bugs 3 and 4 because
no relaunch/sharded reuse occurs.)

## Expected Behavior
With all four fixes applied, a sharded `RFOptuna` RAG eval search runs end-to-end on a 4×L4 box at
`num_shards` = 1, 4, and 8:
 
- trials complete and log metrics,
- the median pruner prunes pipelines after shard boundaries,
- replacement pipelines are suggested, launched, and complete (no init failures),
- 24-config searches (12 initial + 12 relaunched) finish per shard setting.

## Environment
- RapidFire AI: `main` @ `aa7f4c736d0de0582dc086ed69c231fed15a6190` (merged PR #234; past v0.15.2)
- Python 3.12, Ubuntu 24.04
- 4× NVIDIA L4 (24 GB), CUDA 12.x
- Path exercised: `run_evals` (RAG/context-eng), `RFOptuna` sampler with median pruner,
  `num_shards ∈ {4, 8}` (i.e. `>1`)
- Generator: a closed-model API (Gemini 2.5 Pro via OpenAI-compatible endpoint); RAG eval, no
  local model weights
- Reproducible regardless of generator; the failures are in orchestration, not the model calls

## Additional Context
## Bug 1 — `_object_labels` crashes on dict/None categorical choices
 
**Symptom**
 
```
TypeError: vars() argument must have __dict__ attribute
```
raised from the Optuna label-construction helper when a categorical knob's choices are dicts or
include `None`.
 
**Root cause**
 
`_object_labels` (in the automl/Optuna integration) calls `vars(choice)` to build a human-readable
label for each categorical option. That assumes every choice is an object with a `__dict__`. RAG
knobs routinely pass plain `dict` choices (e.g. a reranker config) or `None` (knob disabled), and
`vars()` raises on both.
 
**Fix**
 
Handle dict and `None` (and any non-`__dict__` value) explicitly instead of assuming an object.
We applied this as a wrapper, `_safe_object_labels`, that returns the dict as-is / a sentinel for
`None` / falls back to `str(choice)` for anything without `__dict__`, and is used in place of the
raw `_object_labels`. In-tree, the equivalent is to guard the `vars()` call:
 
```python
def _object_labels(obj):
    if obj is None:
        return "None"
    if isinstance(obj, dict):
        return obj
    if hasattr(obj, "__dict__"):
        return vars(obj)
    return str(obj)
```
 
---
 
## Bug 2 — `_template_to_leaf_evals` omits the `pipeline` key → controller `KeyError`
 
**Symptom**
 
```
KeyError: 'pipeline'
```
in the controller when it tries to launch a leaf eval config produced by the Optuna template
expansion.
 
**Root cause**
 
`_template_to_leaf_evals` builds each leaf config dict with an `api_config` key but no `pipeline`
key. Downstream, the controller indexes the config by `["pipeline"]` when scheduling/launching, so
the missing key is fatal. (The grid/random expansion paths populate `pipeline`; the Optuna
template path does not.)
 
**Fix**
 
Ensure the leaf config carries a `pipeline` entry. The minimal change is to set `pipeline` from the
`api_config` (rename/duplicate) when constructing the leaf dict in `_template_to_leaf_evals`, so the
config shape matches what the controller expects from the other samplers.
 
```python
# in _template_to_leaf_evals, when assembling each leaf config:
leaf["pipeline"] = leaf.pop("api_config")   # or: leaf["pipeline"] = leaf["api_config"]
```
 
(Match whichever key the grid/random paths use so all samplers produce identically-shaped configs.)
 
---
 
## Bug 3 — `query_actor.py` deletes the inference engine, then reuses a stale hash → `AttributeError` under sharding
 
**Symptom**
 
```
AttributeError: 'QueryActor' object has no attribute 'inference_engine'
```
on the second and later shards of a config (i.e. only manifests with `num_shards > 1`).
 
**Root cause**
 
Two interacting lines in `rapidfireai/evals/actors/query_actor.py`:
 
- On teardown of a shard, the actor does `del self.inference_engine` (≈ line 149). After the
  attribute is deleted, any later access raises `AttributeError` rather than being recognized as
  "no engine present".
- The reuse branch (≈ line 139) decides whether to rebuild the engine **purely from a config
  hash** — if the new shard's config hash matches the previous one, it assumes the engine is still
  there and skips rebuilding. But the engine was `del`'d, so the hash matches while the attribute
  is gone → it tries to use `self.inference_engine` and `AttributeError`s.
**Fix**
 
Two small changes:
 
1. Set the attribute to `None` instead of deleting it, so its absence is representable:
   ```python
   # query_actor.py ~line 149
   self.inference_engine = None     # was: del self.inference_engine
   ```
2. Make the reuse branch also check the engine actually exists, not just that the hash matches:
   ```python
   # query_actor.py ~line 139
   if config_hash == self._last_config_hash and self.inference_engine is not None:
       ...reuse...
   else:
       ...rebuild...
   ```
 
With both, a config spanning multiple shards correctly reuses its engine when present and rebuilds
when it was torn down.
 
---
 
## Bug 4 — replacement (relaunched) pipelines don't inherit the parent's rate limiter / max-completion-tokens → init failure
 
**Symptom**
 
After the Optuna pruner prunes a pipeline and the controller suggests a replacement, the
relaunched pipeline fails to initialize its inference engine:
 
```
APIInferenceEngine missing rate_limiter_actor   (replacement pipelines fail-init under sharding)
```
All replacement trials fail to start; only the initial wave of trials runs.
 
**Root cause**
 
In the controller's replacement-pipeline loop (≈ line 1673, after
`scheduler.add_pipeline(new_pid, ...)`), the new pipeline id is registered with the scheduler but
is **not** added to the maps that hold per-pipeline API resources — specifically the rate limiter
actor and the max-completion-tokens setting. The initial pipelines get these at experiment setup;
replacements created mid-run inherit neither, so when the worker builds the `APIInferenceEngine`
for a replacement it has no `rate_limiter_actor` and init fails.
 
**Fix**
 
When creating a replacement pipeline, copy the parent's API-resource entries to the new pipeline
id. We added, immediately after `scheduler.add_pipeline(new_pid, ...)`:
 
```python
# controller.py ~line 1673, in the replacement-pipeline loop
pipeline_to_rate_limiter[new_pid] = pipeline_to_rate_limiter[pipeline_id]
pipeline_to_max_completion_tokens[new_pid] = pipeline_to_max_completion_tokens[pipeline_id]
```
 
(Use whatever the parent's id variable is named in that scope — `pipeline_id` / `parent_pid`.)
With this, replacement pipelines initialize their inference engine and complete their shards
normally. Verified: with the fix, `num_shards` 4 and 8 report `init failures: 0` and replacement
pipelines (ids 13+) complete all shards.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RFOptuna eval path (PR #234): 4 bugs prevent end-to-end sharded Optuna RAG runs (categorical labels, missing pipeline key, query-actor engine reuse, replacement-pipeline rate limiter) #268

Bug Description

To Reproduce

Expected Behavior

Environment

Additional Context

Bug 1 — `_object_labels` crashes on dict/None categorical choices

Bug 2 — `_template_to_leaf_evals` omits the `pipeline` key → controller `KeyError`

Bug 3 — `query_actor.py` deletes the inference engine, then reuses a stale hash → `AttributeError` under sharding

Bug 4 — replacement (relaunched) pipelines don't inherit the parent's rate limiter / max-completion-tokens → init failure

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] RFOptuna eval path (PR #234): 4 bugs prevent end-to-end sharded Optuna RAG runs (categorical labels, missing pipeline key, query-actor engine reuse, replacement-pipeline rate limiter) #268

Description

Bug Description

To Reproduce

Expected Behavior

Environment

Additional Context

Bug 1 — _object_labels crashes on dict/None categorical choices

Bug 2 — _template_to_leaf_evals omits the pipeline key → controller KeyError

Bug 3 — query_actor.py deletes the inference engine, then reuses a stale hash → AttributeError under sharding

Bug 4 — replacement (relaunched) pipelines don't inherit the parent's rate limiter / max-completion-tokens → init failure

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `_object_labels` crashes on dict/None categorical choices

Bug 2 — `_template_to_leaf_evals` omits the `pipeline` key → controller `KeyError`

Bug 3 — `query_actor.py` deletes the inference engine, then reuses a stale hash → `AttributeError` under sharding