happyFish · happyFish · Apr 25, 2026 · Apr 25, 2026 · Apr 25, 2026 · Apr 25, 2026
diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock
@@ -0,0 +1 @@
+{"sessionId":"d8c05a9f-ecd4-4918-ba19-035cdd531a1a","pid":805816,"acquiredAt":1778002935354}
diff --git a/CLAUDE.md b/CLAUDE.md
diff --git a/docs/plans/LORA_PLAN.md b/docs/plans/LORA_PLAN.md
@@ -0,0 +1,155 @@
+# Plan: LoRA Support
+
+## Context
+`schema.py` has `supports_lora = True` already. `pipeline.py` has stub `load_lora` and `fuse_lora` methods that aren't called. Scope has a `download_lora` endpoint already — verify in the parent repo `daydreamlive-scope`.
+
+## Schema Changes (`src/scope_streamdiffusion/schema.py`)
+
+Add a `LoraSpec` model and a `loras` list field on `StreamDiffusionConfig`:
+```python
+class LoraSpec(BaseModel):
+    repo_id: str  # HF repo or local path
+    weight_name: Optional[str] = None  # for repos with multiple files
+    adapter_name: str  # diffusers adapter name; required for stack/swap
+    scale: float = 1.0  # 0..2 typical
+
+class StreamDiffusionConfig(BaseModel):
+    ...
+    loras: list[LoraSpec] = Field(
+        default_factory=list,
+        json_schema_extra=ui_field_config(order=..., label="LoRAs"),
+    )
+```
+
+Order field: place after model selection but before ControlNet config. Reuse Scope's existing LoRA picker UI if one exists in the parent repo's other pipelines.
+
+## Loader Wiring (`ModelLoader` post-refactor, or `pipeline.py` if pre-refactor)
+
+LoRAs attach via `pipe.load_lora_weights(repo_id, weight_name=..., adapter_name=...)`. After loading all requested adapters, call `pipe.set_adapters([names...], adapter_weights=[scales...])`.
+
+**Lifecycle order:**
+1. `ModelLoader._load_model` loads the diffusers pipe.
+2. SDXL fp16 VAE swap.
+3. **LoRA attach.** Iterate `config.loras`, call `pipe.load_lora_weights` per spec.
+4. `pipe.set_adapters(...)` with names + scales.
+5. **Do NOT call `fuse_lora`.** Keep adapters live so scales/swaps work without reload. Only fuse before TRT compilation (next step).
+6. PromptEncoder.attach, ControlNetHandler.attach.
+7. TRTLifecycle.attach. **If TRT is enabled, fuse_lora here** before compiling — TRT bakes weights at compile time, so fused-then-compiled is the only correct path (unless using the refit path — see TRT Refit below).
+
+## Change Detection
+Track a "LoRA signature" (sorted tuple of `(repo_id, weight_name, adapter_name, scale)`) on the model loader. On `_swap_model` / `_ensure_pipe_loaded`:
+- Same model + same LoRA signature → no-op.
+- Same model + different LoRA signature, **eager mode** → call `pipe.unload_lora_weights()`, then re-attach. Cheap, no reload needed.
+- Same model + different LoRA signature, **TRT mode without refit** → full reload required. Treat this as a model swap. Surface the cost in the UI — recompiling SDXL UNet is 10+ minutes.
+- Same model + different LoRA signature, **TRT mode with refit-capable engine** → refit (see below). 1–10s instead of 10+ min.
+- Scale-only change with same adapters loaded, **eager mode** → `pipe.set_adapters(...)` with new weights. No reload.
+- Scale-only change, **TRT non-refit** → full reload. **TRT refit** → refit.
+
+## Cache Coordination with TRT
+The TRT cache key (in `_trt_cache.py` / `trt_engines.py`) must include the LoRA signature. Otherwise two different LoRA stacks will collide on the same cache slot and you'll silently load the wrong engine. Hash the sorted signature into the engine filename.
+
+When using refit, the cache key for the *engine* uses only the base model + refit-capable flag (LoRA signature does NOT affect the engine identity). The fused weights are applied at refit time. The LoRA signature is tracked separately as the "currently refit-applied state" and used only for change detection.
+
+## Scope Integration
+The user mentioned Scope has a `download_lora` endpoint already. Find it in the parent repo (`daydreamlive-scope`) and confirm:
+- Whether it returns a local path or a repo_id.
+- Whether the UI already has a LoRA picker in other pipelines that we can match.
+- Whether LoRA management is per-pipeline or global.
+
+Match the existing pattern. Don't invent a new one.
+
+## Testing
+1. Eager SD-Turbo + a single style LoRA from CivitAI (download via Scope, attach via config).
+2. Live scale change 0.0 → 1.0 → 1.5. Should update without reload.
+3. Live LoRA swap (different adapter). Should be fast (unload + load), no model reload.
+4. Toggle TRT on with LoRAs attached. Confirm fuse-then-compile path runs and engine is cached with LoRA-aware key.
+5. Live LoRA change with TRT on (non-refit) — confirm full reload + recompile triggers and completes.
+6. Stack 2 LoRAs simultaneously. Verify `set_adapters` with multiple names works and scales are independent.
+7. SDXL + LoRA (eager and TRT).
+
+## Out of Scope (defer)
+- Multi-LoRA blending UI beyond stack-with-scales.
+- LoRA training or merging.
+
+---
+
+# Addendum: TRT Refit Path for LoRAs
+
+The base plan above says "LoRA change with TRT → full reload" — correct but expensive (10+ min for SDXL). TensorRT's **refit** feature lets you update weights in a built engine without rebuilding it. This is the right answer for live LoRA swaps on TRT.
+
+## What Refit Buys You
+- Engine structure (layers, shapes, fusions) stays compiled.
+- Only the weight tensors get re-uploaded.
+- Typical refit time: **1–10 seconds** for SDXL UNet vs. 10+ minutes for full rebuild.
+- Works for scale changes AND adapter swaps, as long as the LoRA targets the same layers.
+
+## Build-Time Requirements
+The engine must be compiled with refit enabled. Two flags in the TRT builder:
+- `BuilderFlag.REFIT` — required.
+- `BuilderFlag.STRIP_PLAN` (TRT 10+) — optional but recommended; strips weights from the engine file so you ship a smaller cache and refit at load. Trade-off: load is no longer instant — must refit before first inference.
+
+**Decision:** use `REFIT` only (not `STRIP_PLAN`). Cached engines stay self-sufficient; refit only runs when LoRAs change. The size penalty for `REFIT`-only is small (~5%) and inference perf is unchanged.
+
+## Implementation Sketch
+
+### Builder changes (`src/scope_streamdiffusion/_trt/builder.py` or wherever the network config lives)
+Add `network_flags` / `builder_config.flags |= 1 << int(trt.BuilderFlag.REFIT)` to all UNet builders (`build_unet_engine`, `build_unet_sdxl_engine`, `build_unet_with_control_engine`, and the new SDXL+control variant). VAE/TAESD/ControlNet engines don't need it — LoRAs target UNet only (cross-attention layers).
+
+### Refit at runtime (new method on `TRTLifecycle`)
+```python
+def refit_lora(self, lora_signature):
+    # 1. Load base UNet weights into a temporary diffusers UNet (CPU OK).
+    # 2. Apply LoRA stack to that UNet (load_lora_weights + set_adapters + fuse_lora).
+    # 3. Use trt.Refitter to push the fused weights into the live engine.
+    # 4. Discard the temp UNet.
+```
+
+The refitter API:
+```python
+refitter = trt.Refitter(self._trt_unet_engine.engine, TRT_LOGGER)
+for name in refitter.get_all_weights():  # or get_missing()
+    weights = fused_unet_state_dict[map_trt_name_to_torch(name)]
+    refitter.set_named_weights(name, weights)
+assert refitter.refit_cuda_engine()
+```
+
+### Name mapping (the hard part)
+TRT weight names come from the ONNX export and don't match diffusers' `state_dict` keys 1:1. You need a map. Two approaches:
+1. **Build the map at compile time.** During ONNX export, record the `(torch_param_name → onnx_initializer_name)` mapping and persist it next to the engine in the cache. At refit time, load the map and translate.
+2. **Reconstruct the map at refit time** by re-running ONNX export on a dummy UNet with the same architecture and reading the resulting initializer names. Slower but simpler.
+
+Recommend approach 1. Save the map as `<engine>.refit_map.json` alongside the engine file. The TRT cache key already covers architecture variants, so the map is valid for the engine.
+
+### Cache key change
+Refit-capable engines and refit-incapable engines are different artifacts. Add `refit=True` to the cache key path component so old (non-refit) cached engines aren't reused. Old engines stay valid for non-LoRA streams; new ones get used when LoRAs are configured.
+
+## Updated LoRA Lifecycle (replaces "full reload" branch in the base plan)
+
+| Change | Eager | TRT (refit-capable engine) | TRT (legacy non-refit engine) |
+|---|---|---|---|
+| Scale only | `set_adapters` | refit | rebuild |
+| Adapter swap, same layers | unload + load + `set_adapters` | refit | rebuild |
+| Adapter swap, different layers | same | refit (zero out unused) | rebuild |
+| Add ControlNet, etc. | rebuild pipeline state | rebuild engine | rebuild |
+
+"Different layers" case: if a new LoRA targets layers the previous one didn't, those original-weight slots need to be restored to the base model's weights during refit. The fused-state-dict approach handles this naturally since the temp UNet is built from base weights + new LoRA stack.
+
+## When to Skip Refit
+- First time TRT is enabled with LoRAs configured → fuse first, then build (current plan). Refit only helps on subsequent changes.
+- Engine compiled before this feature lands → fall back to rebuild. Detect via the cache-key version bump.
+- Refitter reports missing weights → log and rebuild. Don't run a partially-refit engine.
+
+## Testing (Refit-specific)
+1. Cold start with one LoRA + TRT. Confirm engine builds with `REFIT` flag (check `engine.refittable`).
+2. Live scale change 0.0 → 1.5. Should complete in <10s, no recompile log.
+3. Live adapter swap (different LoRA, same target layers). Same speed.
+4. Live adapter swap to a LoRA that targets *additional* layers. Confirm refit covers all weights and output is correct.
+5. Stress test: 20 rapid scale/adapter changes. Memory should stay stable (the temp UNet must actually free).
+6. SDXL refit specifically — name-map size is larger; verify no missing weights.
+
+## Risk
+- TRT refit name mapping is fiddly. Budget time for debugging the ONNX-name ↔ torch-name mapping.
+- Some TRT optimizations bake constants. If a LoRA's effective rank changes the optimal kernel choice, refit produces correct but suboptimal output. Acceptable trade-off.
+- `STRIP_PLAN` is tempting but adds first-inference latency. Skip it.
+
+This makes live LoRA swaps on TRT actually viable instead of "technically supported but never used."
diff --git a/docs/plans/README.md b/docs/plans/README.md
@@ -0,0 +1,19 @@
+# Plans
+
+Hand-off plans for the next round of work on `sd-multi-model`. Each plan is self-contained and intended to be executed by another agent without needing the originating conversation.
+
+- [REFACTOR_PLAN.md](REFACTOR_PLAN.md) — decompose `pipeline.py` into helper classes (`TRTLifecycle`, `ModelLoader`, `InferenceCore`) following the `PromptEncoder` / `ControlNetHandler` pattern.
+- [SDXL_CONTROLNET_PLAN.md](SDXL_CONTROLNET_PLAN.md) — wire SDXL ControlNet through the eager and TRT paths (currently raises `NotImplementedError` on TRT for SDXL).
+- [LORA_PLAN.md](LORA_PLAN.md) — schema, loader wiring, change detection, and the TRT refit path for live LoRA swaps.
+
+## Recommended order
+1. Refactor (lands first — the LoRA plan assumes the `ModelLoader` and `TRTLifecycle` helpers exist).
+2. SDXL ControlNet (independent of LoRA).
+3. LoRA (depends on refactor; benefits from but does not require ControlNet work).
+
+## Architectural pattern (read first)
+All three plans assume the helper-class composition pattern. The canonical examples in the repo:
+- `src/scope_streamdiffusion/prompt_encoder.py`
+- `src/scope_streamdiffusion/controlnet.py`
+
+Helpers take `(device, dtype)` at construction, gain a pipe back-reference via `attach(pipe, sdxl)`, and expose runtime state as instance attributes.
diff --git a/docs/plans/REFACTOR_PLAN.md b/docs/plans/REFACTOR_PLAN.md
@@ -0,0 +1,116 @@
+# Refactor Plan: pipeline.py Decomposition
+
+## Goal
+Reduce `pipeline.py` from ~1900 lines to a thin orchestrator (~400 lines) by extracting cohesive responsibilities into helper classes. Follow the pattern established by `PromptEncoder` (commit `b1b5478`) and the existing `ControlNetHandler`.
+
+## Architectural Pattern (non-negotiable — already established)
+- Helper class lives in its own module under `src/scope_streamdiffusion/`.
+- Constructor takes `(device, dtype)` and any static config.
+- `attach(pipe, sdxl: bool)` lifecycle method called from `_ensure_pipe_loaded` and `_swap_model` after the diffusers pipeline is loaded. Helpers re-bind to the new pipe here.
+- Helper owns its caches and exposes runtime state as instance attributes the pipeline reads through (e.g., `self.prompts.prompt_embeds`).
+- Helper has explicit `reset_caches()` / `release()` methods called on model swap or teardown.
+- **No mixins.** Composition only. The user explicitly rejected mixins.
+
+## Reference Files
+- `src/scope_streamdiffusion/prompt_encoder.py` — the template. Read this first.
+- `src/scope_streamdiffusion/controlnet.py` — second example of the pattern.
+- `src/scope_streamdiffusion/pipeline.py` — the source to extract from.
+
+## Extraction Order (do them in this order, commit between each)
+
+### Extraction 1: `TRTLifecycle` → `src/scope_streamdiffusion/trt_lifecycle.py`
+**Methods to move:**
+- `_ensure_trt_taesd`
+- `_ensure_trt_controlnet`
+- `_ensure_trt_unet`
+- `_setup_trt`
+- `_reset_trt_state`
+- `_set_acceleration_mode`
+- `_deactivate_trt`
+- `_trt_setup_args_from_config`
+
+**Compromise to accept:** these methods currently mutate `self.unet`, `self.controlnet`, `self.vae`, `self._taesd_vae` directly. Don't fight it — give the helper a back-reference to the pipeline (`self.pipe = pipe` set in `attach()`) and have it write through. The win is moving 500 lines of TRT-specific lifecycle code out of the orchestrator, not pretending TRT doesn't touch pipeline state.
+
+**Caches the helper owns:** `_trt_taesd_paths`, `_trt_controlnet_paths`, `_trt_unet_paths`, `_trt_unet_engine`, `_trt_controlnet_engine`, the `acceleration_mode` last-applied value, and the `_trt_cache` adapter handles. The module-scope `_trt_cache._CACHE` stays where it is — it must survive plugin reinit.
+
+**Pipeline-side after extraction:**
+```python
+self.trt = TRTLifecycle(device=self.device, dtype=self.dtype)
+# in _ensure_pipe_loaded / _swap_model:
+self.trt.attach(self, self.sdxl)
+# in __call__'s pre-inference setup:
+self.trt.ensure_engines(config, want_control=...)
+```
+
+**Testing checkpoint after this extraction:**
+1. Cold-load each model with `acceleration_mode="trt"`: SD-Turbo, SDXL-Turbo, DMD2.
+2. Live-swap from SD-Turbo → SDXL-Turbo → DMD2 → SD-Turbo. Confirm no `context=None` crashes (the band-aid `_ensure_activated` in `_trt/engine.py` should still cover this; if it triggers, that's a regression in the swap teardown path).
+3. Toggle ControlNet on SD1.5 + SD-Turbo while running.
+4. Switch `acceleration_mode` between `none` / `xformers` / `trt` mid-stream.
+
+---
+
+### Extraction 2: `ModelLoader` → `src/scope_streamdiffusion/model_loader.py`
+**Methods to move:**
+- `_load_model`
+- `_load_preset`
+- `_release_pipe_state`
+- `_swap_model`
+- `_install_sdxl_fp16_vae`
+- `_set_taesd`
+- `load_lora` (currently a stub — leave as-is, the LoRA plan wires it up)
+- `fuse_lora` (stub — same)
+
+**State the helper owns:** the `MODEL_PRESETS` dict (move it to this module), last-loaded `model_id`, last-loaded preset signature, the SDXL fp16 VAE replacement state, TAESD-installed flag.
+
+**Compromise:** like TRT, this writes through to `self.pipe`, `self.unet`, `self.vae`, `self.text_encoder`, `self.text_encoder_2`, `self.tokenizer`, `self.tokenizer_2`, `self.scheduler`, `self.sdxl`. Use the back-reference; the goal is consolidation, not purity.
+
+**Order matters in `attach`/swap flow:** ModelLoader runs first, then PromptEncoder.attach, then ControlNetHandler.attach, then TRTLifecycle.attach. Document this in a comment at the top of `pipeline._ensure_pipe_loaded`.
+
+**Testing checkpoint:**
+1. Cold load each preset.
+2. Swap each direction. Verify no double-loaded models in VRAM (`nvidia-smi` while swapping).
+3. Verify SDXL fp16 VAE replacement still happens on SDXL-Turbo and DMD2.
+4. Verify TAESD eager and TRT both still work.
+
+---
+
+### Extraction 3: `InferenceCore` → `src/scope_streamdiffusion/inference_core.py`
+**Methods to move:**
+- `_set_timesteps`
+- `_initialize_noise`
+- `_setup_seed_transition`
+- `_slerp_noise`
+- `_advance_seed_transition`
+- `_cancel_seed_transition`
+- `_encode_image`
+- `_decode_image`
+- `_add_noise`
+- `_scheduler_step_batch`
+- `_unet_step`
+- `_predict_x0_batch`
+
+**State the helper owns:** `alpha_prod_t_sqrt`, `beta_prod_t_sqrt`, `c_skip`, `c_out`, `sub_timesteps_tensor`, `init_noise`, `x_t_latent_buffer`, the seed-transition fields (`_pending_seed`, `_transition_remaining`, etc.).
+
+**Reads (not writes) from pipeline:** `self.pipe.prompts.prompt_embeds`, `self.pipe.unet`, `self.pipe.controlnet`, `self.pipe.controlnet_input`, `self.pipe.vae`, `self.pipe.scheduler`. Pass these through the back-reference.
+
+**`__call__` after this extraction shrinks to roughly:**
+```python
+def __call__(self, **kwargs):
+    config = self._validate_config(kwargs)
+    self._prepare_runtime_state(config)
+    self.prompts.encode_for_frame(...)
+    if self.controlnet_handler:
+        self.controlnet_handler.update(...)
+    latent = self.inference.run_step(video, config)
+    return {"video": self.inference.to_scope_format(latent)}
+```
+
+**Testing checkpoint:** full smoke test — every model × (txt2img / img2img / loopback) × (eager / xformers / TRT) × (with/without negative prompt) × seed transitions.
+
+## Cross-cutting Rules
+- **Don't change behavior.** This is a pure move. If you find a bug, note it in a comment — fix it in a separate commit after the refactor lands.
+- **Commit per extraction.** Three commits. Each must pass the testing checkpoint before moving to the next.
+- **Don't extract `__init__`, `prepare`, `_prepare_runtime_state`, `__call__`, `get_config_class`, or the schema-driven setters.** These are the orchestrator's job.
+- **Don't add abstract base classes or interfaces** for the helpers. Three concrete classes is fine.
+- **Don't introduce a `BaseHelper` parent class.** They share a pattern, not behavior.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"sessionId":"d8c05a9f-ecd4-4918-ba19-035cdd531a1a","pid":805816,"acquiredAt":1778002935354}