Skip to content

Turbo-only simplification + DMD2 preset + SDXL TRT support#2

Open
happyFish wants to merge 21 commits intomainfrom
sd-multi-model
Open

Turbo-only simplification + DMD2 preset + SDXL TRT support#2
happyFish wants to merge 21 commits intomainfrom
sd-multi-model

Conversation

@happyFish
Copy link
Copy Markdown
Owner

@happyFish happyFish commented May 5, 2026

Summary

What started as a "trim the multi-model branch back to Turbo only" PR
has grown into a full SDXL/TRT enablement push. End-to-end: the plugin
now ships a tested SDXL + DMD2 path with dynamic-shape TRT acceleration
at 1024×1024.

What's in the dropdown now

  • stabilityai/sd-turbo — SD 1.5 1-step, eager + TRT (existing path).
  • stabilityai/sdxl-turbo — SDXL 1-step, eager + TRT (new in this PR).
  • dmd2-sdxl-1step — SDXL base + DMD2 distilled UNet, eager + TRT
    (new in this PR)
    . Preset entry handles the model assembly + the
    [399] training timestep override DMD2 needs.

Verified perf on a 4090 at 1024×1024

Path DMD2 SDXL-Turbo
Eager fp16 + xformers ~91 ms/frame ~91 ms/frame
TRT static (single resolution) 11 ms/frame 11 ms/frame
TRT dynamic 512-1024 (this PR) 13 ms/frame 14 ms/frame

Dynamic-shape engine handles 1024×1024, 1024×768 (landscape), and
768×1024 (portrait) with no rebuild — confirmed.

What's removed

  • Multi-step LCM-LoRA path (_attach_lcm_lora, _predict_x0_serial,
    denoising_steps_num > 1 branches). All supported models are
    1-step distillations now.
  • num_inference_steps and use_suggested_num_inference_steps schema
    fields (dead at 1-step).
  • The Lykon/dreamshaper-*, stable-diffusion-v1-5/...,
    stabilityai/stable-diffusion-xl-base-1.0 entries from the dropdown.
  • HANDOFF_TURBO_ONLY.md (committed early when scope was just Turbo).

What's added

Schema / pipeline (the easy half)

  • MODEL_PRESETS dict in pipeline.py as the extension point for
    curated multi-piece recipes. DMD2 lives there as
    (base_model, unet_swap, timesteps_override).
  • _load_preset() for unet_swap-shape recipes:
    load SDXL base → download DMD2 UNet via hf_hub_download
    override pipe.unet.state_dict. DMD2's repo ships weights only
    (no config.json), so from_pretrained doesn't work directly.
  • timesteps_override plumbed through _set_timesteps. DMD2-1step is
    distilled at t=399 specifically; LCMScheduler's default 1-step
    picks ~t=979 and gets garbage out.

TRT (the hard half)

  • UNetSDXL I/O spec + UNetSDXLExportWrapper — adds
    text_embeds (1280) and time_ids (6) as named ONNX inputs so
    SDXL's get_aug_embed doesn't crash on added_cond_kwargs=None
    during export.
  • UNet2DConditionModelSDXLEngine runtime adapter feeding all 5
    inputs.
  • compile_unet_sdxl skips polygraphy's ONNX optimizer (passes the
    same path twice for raw + opt) — polygraphy OOMs on the ~5 GB SDXL
    ONNX, and TRT's builder does its own graph optimization anyway.
  • ONNX export uses external_data=True (torch 2.9 name; was
    use_external_data_format pre-2.5) so the >2 GB SDXL UNet ONNX
    serializes correctly. Post-processes the raw export to consolidate
    the ~1500 per-tensor sidecar files into one .weights blob —
    pytorch's location-only entries trip TRT's WeightsContextMemoryMap
    on certain initializers ("Failed to open file" on a file that exists).
  • build_unet_sdxl_engine + TRTUNetSDXLAdapter. Dynamic-shape build
    over [512, 1024] on both axes, static batch=1 (guidance_scale=0
    means inference never uses batch>1, dynamic batch would just double
    workspace cost). Engine cache key encodes the resolution range so
    in-range runs hit the same cached file.
  • _ensure_trt_unet accepts explicit image_height / image_width
    args. _prepare_runtime_state (which sets self.height / self.width)
    hasn't run yet when this method fires, so without explicit dims the
    build sized for __init__ defaults (512×512) and mismatched at
    inference. Pre-emptively setting self.{height,width} would block
    dims_changed in _prepare_runtime_state and leave latent dims
    stale — explicit-args is the cleanest cut.
  • During the build, VAE + text encoders are moved to CPU to free VRAM
    for TRT's TACTIC_DRAM allocation. UNet stays on GPU (the ONNX tracer
    needs it there).
  • _ConfigShim(sdxl=True) returns the SDXL cross_attention_dim=2048
    and addition_time_embed_dim=256. The SDXL adapter also stubs
    add_embedding.linear_1.in_features=2816 because the pipeline
    introspects that surface to size add_time_ids.

Host-side prerequisites for TRT

These were discovered the hard way and aren't in the wheel install
flow (yet) — I left them as deployment notes:

  1. LD_LIBRARY_PATH must include the venv's tensorrt_libs dir at
    process exec time. TRT lazy-dlopens
    libnvinfer_builder_resource_smXX.so.10.x during engine build.
    That lib has SONAME do_not_link_against_* so the ldconfig cache
    misses on filename lookup. ctypes preload from inside Python is
    too late — the dynamic linker reads LD_LIBRARY_PATH at exec
    time only.
  2. /etc/ld.so.conf.d entry pointing at venv lib dirs covers cuDNN
    sublibs but isn't sufficient on its own (per Add Daydream node page link to README #1).
  3. /usr/local/lib symlinks help filename-by-filename dlopen for
    the main libs but don't cover the SM-specific lazy-loaded ones.

Known limitations (deferred)

  • SDXL + ControlNet + TRT raises NotImplementedError. The
    existing UNetWithControlInputs and ControlNet model specs are
    SD-1.5-shaped (12 down residuals at the (320, 640, 1280, 1280)
    pattern); SDXL has a different shape. Doable but ~500 LoC of new
    TRT plumbing mirroring the existing SD 1.5 ControlNet path.
  • Static batch=1 for SDXL engines. Allowing dynamic batch >1 would
    double TRT workspace cost; with guidance_scale=0 always being the
    Turbo / DMD2 mode, no inference benefit.
  • Build envelope 512-1024 (not 256-1024). Wider envelopes blew the
    build memory budget on a 24 GB card. Practical sweet spot.
  • DMD2 produces visibly weaker output than SDXL-Turbo on short
    prompts ("the sun"). Both produce excellent output on descriptive
    prompts (oil-painting style + lighting cues). Documented in the
    schema description.

Test plan

  • python -m py_compile src/scope_streamdiffusion/*.py src/scope_streamdiffusion/_trt/*.py — clean
  • SDXL-Turbo eager 1024 — Turner-style sun → moon morph renders correctly
  • DMD2 eager 1024 — same morph, comparable output
  • SDXL-Turbo TRT (static + dynamic 512-1024) — 11/14 ms/frame steady, frames visually equivalent to eager
  • DMD2 TRT (static + dynamic 512-1024) — 11/13 ms/frame steady, frames visually equivalent to eager
  • Aspect-ratio test: same dynamic engine renders 1024×1024, 1024×768, 768×1024 cleanly
  • Hot-reload in a running Scope and pick each model from the dropdown
  • Confirm acceleration_mode='trt' on SDXL-Turbo / DMD2 in a real Scope session (not just standalone scripts)

🤖 Generated with Claude Code

happyFish and others added 21 commits April 24, 2026 17:03
Five fixes that together let LCM-LoRA'd SD1.5, SDXL, SDXL-Turbo, and
Dreamshaper variants produce sharp deterministic txt2img output:

1. Auto-fuse the matching LCM LoRA (lcm-lora-sdv1-5 / lcm-lora-sdxl) for
   non-Turbo bases. The schema declared use_lcm_lora but nothing wired it
   up, so non-Turbo models were running LCMScheduler with un-distilled
   UNet weights and outputting yellow/black blobs.

2. Swap SDXL's stock VAE for madebyollin/sdxl-vae-fp16-fix on load. The
   stock VAE decodes NaN in fp16, so every SDXL frame was pure black.

3. SDXL conditioning (add_text_embeds, add_time_ids) now broadcasts to
   the current batch size when t_index_list has multiple entries.

4. Per-family default num_inference_steps: 1 for sd-turbo proper, 4 for
   everything else. Single-step at t=999 only converges for the model
   distilled for that exact regime; SDXL-Turbo / Dreamshaper-XL-Turbo /
   non-Turbo + LCM LoRA are blurry at 1 step and sharp at 4. Exposed as
   the "Inference Steps" UI slider with an "Auto Inference Steps"
   toggle to defer to per-family suggestion.

5. Two text-mode bugs in __call__:
   - Image-loopback was implicit ("video missing AND prev_image_result
     exists"), making each frame feed its previous output back as input
     and drift to over-saturated abstract patterns. Now opt-in only.
   - Input latent used unseeded torch.randn each call, so seed=42 still
     produced a different scene per frame. Now reuses the seeded
     init_noise[0:1] for stable, deterministic output.

Verified across sd-turbo, SD1.5, Dreamshaper-8, SDXL-Turbo, SDXL-Base,
and Dreamshaper-XL-v2-Turbo at 512 / 1024.
Limits the UI to the six model IDs verified to produce sharp output via
this pipeline's auto-LCM-LoRA + fp16-fix-VAE plumbing. Also adds the
field to the UI surface (was previously schema-only).
The schema field is named ``model_id_or_path`` and Scope's pipeline_manager
merges schema defaults into __init__ kwargs by their declared name, but
__init__ only read ``model_id`` — so picking a model in the UI was silently
ignored and the default reloaded every time.
Scope routes model_id_or_path through setNodeParams (the runtime/kwargs
path), not through pipeline/load. Previously __call__ ignored the
incoming value, so picking a different model in the UI updated logs but
left the original weights loaded. Detect a mismatch against self.model_id
and reload the weights in place — re-attaching the LCM LoRA / fp16-fix
VAE per family, freeing the old pipe first to avoid 2x VRAM, and
invalidating prompt / timestep / noise caches so the next frame rebuilds
against the new model.
…e field

Use the same config/kwargs lookup path that strength, seed, etc. use,
instead of a hand-rolled kwargs.get() ahead of the rest of __call__.
… channels

StreamDiffusion's batch denoising emits one frame per __call__ but each
frame is at a different t_index in the cycle (frame i -> t_index i mod N).
Across video that smooths out; for a steady text prompt it shows up as
N different denoising stages flashing one after another. Switch to
sequential denoising (all N steps inside one __call__) when there's no
video input and the schedule has >1 step, so each frame is one fully
denoised image.
Adds _predict_x0_serial as a sibling of _predict_x0_batch and routes to
it when num_inference_steps > 1 in steady-prompt modes (no video input,
or explicit image_loopback) with ControlNet off. Walks the full N-step
LCM schedule inside one __call__, so each emitted frame is one fully
denoised image instead of one slot of the rolling N-track buffer cycle
that otherwise flashes N different attractors at the camera.

The batch path still owns:
- num_inference_steps == 1 (degenerates to one UNet call anyway, and
  it's the path SD-Turbo and the depth/scribble ControlNet pre-passes
  expect)
- video input / v2v streams (where the buffer reuse trick actually
  amortises across consecutive related frames — its design point)
- ControlNet streams (same reasoning)

Routing decision is a single boolean (`use_serial`) computed alongside
the other extracted params; the rest of __call__ branches on it
exactly twice — once to skip auto-noising the encoded image (serial
adds its own noise based on `strength`) and once to pick the predict
function. Batch path is untouched.
Scope rebuilds the plugin instance on every graph edit, which clears the
in-memory `_trt_*_built` flags and forces a per-engine deserialize/bind
cycle (visible stalls of hundreds of ms to seconds, plus the rare full
ONNX→TRT compile). Hold the built adapters at module scope keyed by the
graph node id so the new instance can swap them straight back in.

- New `_trt_cache.py`: `CachedTRTState` (cuda_stream, unet_adapter,
  unet_has_controlnet, cn_adapters dict, taesd_adapter) keyed by
  `node:<id>`, with signature `(model_id, height, width)` so a real
  config change still triggers a clean rebuild.
- `pipeline.py`: read `node_id` from kwargs (Scope must pass it through;
  until that lands, falls back to `_anon_<model_id>` — correct for the
  single-SD-node case). At first `_ensure_trt_*` call, look up the
  cache; on hit, swap `self.unet` / `self.controlnet` / `self.vae` to
  the cached adapter and skip the build. On miss, build then write back.
Replaces Literal[256, 320, ...] tuple on width/height with a Resolution
IntEnum and a `mode='before'` field_validator that coerces ints into
enum members and raises a clear error listing all allowed values
otherwise. Pipeline code already wraps width/height in `int()`, so
behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Node-id-keyed TRT adapter cache so swapping/replacing graph nodes
  no longer wipes warm engines (25478b7)
- Schema: width/height as Resolution IntEnum + field_validator (7c89f3d)
# Conflicts:
#	src/scope_streamdiffusion/pipeline.py
#	src/scope_streamdiffusion/schema.py
Trim model_id_or_path enum to stabilityai/sd-turbo and stabilityai/sdxl-turbo
— both 1-step distillations. Drops Dreamshaper, SD 1.5 base, SDXL base, and
the Dreamshaper SDXL Turbo variant: keeping the multi-step models meant
carrying LCM LoRA fusion + a serial denoise path that we no longer need.

Removes num_inference_steps and use_suggested_num_inference_steps fields:
both are dead now that step count is fixed at 1 for every supported model.
LoRA-based step distillation (Hyper-SD / Lightning) on arbitrary checkpoints
is the better path forward — tracked separately, not in this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the schema only allows SD-Turbo and SDXL-Turbo, the runtime can
shed everything that existed to make non-Turbo models usable at low step
counts:

- self.sd_turbo flag (everything is Turbo now) and the per-family
  step-count branch in __call__
- _attach_lcm_lora() and its call sites in __init__ / _swap_model
  (LCM LoRA was only fused for non-Turbo SD 1.5 / SDXL bases)
- _predict_x0_serial() and the use_serial branch in __call__ —
  serial denoise was added for steady-prompt txt2img / image-loopback
  on multi-step models; with 1-step Turbo it never fires
- denoising_steps_num > 1 dead branches in _prepare_runtime_state and
  _predict_x0_batch (always 1 now)
- num_inference_steps plumbing — pinned at 1 in __call__

Untouched: TRT engine swap, ControlNet handling, prompt transitions,
RCFG, mask compositing, hot-swap between sd-turbo and sdxl-turbo, and
the SDXL fp16-fix VAE swap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a curated 1-step model option that isn't a direct HuggingFace repo:
SDXL-base with the DMD2-distilled UNet (tianweiy/DMD2) swapped in. DMD2
generally outperforms SDXL-Turbo on FID/CLIP per the paper, while staying
on the same LCMScheduler at 1 step that all our existing TRT/runtime
infra is built around.

Introduces a MODEL_PRESETS dict at module scope as the extension point
for future Turbo-class additions:

- 'unet_swap' shape — base pipeline + distilled UNet checkpoint. Used
  here for DMD2; DMD2 retrained the UNet via distribution matching, so
  it ships as a UNet, not a LoRA.
- Future shapes documented inline: 'lora' (Hyper-SD / SDXL-Lightning
  step-distillation LoRAs), 'scheduler' override, 'timesteps_override'.
  Hyper-SD-1step / Lightning-1step both need TCD / Euler schedulers,
  which require a `_set_timesteps` refactor (the current path calls
  LCM-specific `get_scalings_for_boundary_condition_discrete` and reads
  `scheduler.alphas_cumprod` directly). That refactor is out of scope
  for this PR.

The fp16-fix VAE swap, TRT cache keying, hot-swap, and rolling-buffer
denoise math are all untouched — DMD2's UNet is architecturally an SDXL
UNet, so everything downstream of `_load_model` is identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Distilled-UNet repos like tianweiy/DMD2 ship weights only — no
config.json — because the architecture is identical to the base UNet.
UNet2DConditionModel.from_pretrained needs a config and bails with
'tianweiy/DMD2 does not appear to have a file named config.json'.

Switch to: load the base SDXL pipeline (gets a correctly-configured UNet
module), download the DMD2 checkpoint via hf_hub_download, then override
the UNet's state_dict in place. Verified end-to-end with a 300-frame
sun→moon morph render at fp16, no acceleration: 6 fps eager, output
matches expected DMD2 quality.

Same pattern works for SDXL-Lightning's 1-step UNet variant once the
scheduler refactor lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DMD2-1step is distilled at a single specific timestep. Letting
LCMScheduler pick the default 1-step (~979, near pure-noise endpoint)
feeds the model a timestep it was never trained on and produces
garbage — visually a blurry monochrome blob with no recognizable
features.

Add a `timesteps_override` field to MODEL_PRESETS and have
`_set_timesteps` honor it when present. With the override pinned at
[399] (the DMD2 paper's documented training timestep for SDXL 1-step),
the model produces clean photographic output: a recognizable sun /
moon with proper composition, contrast, and detail.

Same mechanism will land Hyper-SDXL-1step (timesteps=[800]) once the
broader scheduler-class refactor on feat/scheduler-refactor catches
up; this commit just gets DMD2 to a usable state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the missing SDXL-shaped TRT path so acceleration_mode='trt' works
on SDXL-Turbo and DMD2-distilled UNets. Eager-only on SDXL was a
pre-existing limitation: ONNX export crashed in get_aug_embed because
the export wrapper passed added_cond_kwargs=None instead of the SDXL
{text_embeds, time_ids} dict. End-to-end this commit:

  * UNetSDXL I/O spec — 5 inputs (sample, timestep, encoder_hidden_states,
    text_embeds dim=1280, time_ids dim=6) instead of SD 1.5's 3.
  * UNetSDXLExportWrapper — wraps the diffusers UNet so text_embeds/time_ids
    are positional args for ONNX trace, reconstructed into added_cond_kwargs
    at the inner forward.
  * UNet2DConditionModelSDXLEngine — runtime engine wrapper feeding all 5
    named inputs to the TRT context.
  * compile_unet_sdxl — same shape as compile_unet but routes through the
    SDXL wrapper. Skips the polygraphy ONNX optimizer (passes the same path
    twice for raw + "opt") because polygraphy's optimizer OOMs on the ~5 GB
    SDXL ONNX; TRT's builder does its own graph optimization.
  * export_onnx — adds use_external_data flag (torch 2.9 param `external_data`)
    so SDXL UNet's >2 GB ONNX serializes correctly. Post-processes the
    raw export to consolidate ~1500 per-tensor sidecar files into one
    weights.bin: pytorch's per-tensor location-only entries trip TRT's
    WeightsContextMemoryMap on certain initializers ("Failed to open file").
  * build_unet_sdxl_engine + TRTUNetSDXLAdapter — build/load. Engine is
    static-shape (build_dynamic_shape=False) and static-batch (max=1).
    SDXL's tactic exploration over a dynamic shape envelope OOMs even on
    24 GB VRAM; static-shape collapses the search space enough to fit.
    Engine is only valid at the (h,w) it was built for — resolution
    changes will rebuild.
  * _ConfigShim — gains an `sdxl=True` mode returning the SDXL
    cross_attention_dim=2048 and addition_time_embed_dim=256 the
    pipeline reads to size add_time_ids. TRTUNetSDXLAdapter also
    fakes an `add_embedding.linear_1.in_features=2816` shim because
    the SDXL pipeline introspects that attribute on UNet.
  * pipeline._ensure_trt_unet — accepts explicit image_height/width
    args. Static engines need the *real* runtime dims at build time;
    self.height/self.width are still init defaults (512x512) when this
    method runs because _prepare_runtime_state hasn't executed yet.
    Pre-emptively setting self.{height,width} would block dims_changed
    in _prepare_runtime_state and leave self.latent_{height,width} at
    init defaults — engine and inference would mismatch in the other
    direction.
  * SDXL build flow moves VAE + text encoders to CPU during the TRT
    build to free VRAM for the builder's TACTIC_DRAM allocation, then
    moves them back. UNet stays on GPU (the ONNX tracer needs it there).

Verified end-to-end on a 4090:
  - SDXL-Turbo @ 1024x1024: 91 ms/frame eager → 11 ms/frame TRT (8.3x)
  - DMD2-SDXL-1step @ 1024x1024: 91 ms/frame eager → 11 ms/frame TRT (8.3x)
  - Output is byte-different but visually equivalent to eager, confirming
    correct numerical behavior.

Build-time prerequisites the wheel install model alone doesn't satisfy
(documented in trt_engines.py header):
  - LD_LIBRARY_PATH must include the venv's tensorrt_libs at process
    exec time. Loader's lazy dlopen of the per-SM kernel library
    (libnvinfer_builder_resource_smXX.so.10.x) bypasses ldconfig because
    those libs have a do_not_link_against_* SONAME, so cache lookup by
    filename fails. ctypes preload from inside Python is too late —
    the dynamic linker reads LD_LIBRARY_PATH at exec time only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Static-shape engines locked the build to a single (h, w) — any
resolution or aspect-ratio change required a 5–10 min rebuild. Replaced
with a dynamic-shape build over the [512, 1024] envelope on both axes.
Same cached engine now serves any in-range resolution.

Verified end-to-end: a single cached engine handles 1024x1024,
1024x768 (landscape), and 768x1024 (portrait) without a rebuild.
Composition adapts to the aspect (wide horizon vs. tall cloud column).

Trade-offs vs. static-shape:
- Steady-state at the opt point (1024x1024): 11 ms/frame → 14 ms/frame.
  ~27% slowdown for the flexibility, expected.
- Build memory: 512-1024 envelope on a 24 GB card with VAE+text-encoders
  on CPU during build → fits cleanly. Wider envelopes (256-1024) blew
  past the budget; 512-1024 is the practical sweet spot.
- Engine size: ~5.2 GB on disk (similar to static).

Cache key now encodes the resolution range
(`unet_sdxl_b1-1_h512-1024_w512-1024`) instead of the opt point, so
engines don't collide across resolution choices and any in-range run
hits the same cached file.

Static batch (max=1) is kept — guidance_scale=0 is the only mode for
Turbo / DMD2, so dynamic batch would just double workspace cost for
no inference benefit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Update acceleration_mode description: actual measured speedup is 2-8x
  (was "~2-3x"), and SDXL engines have a different envelope (512-1024,
  batch=1) than SD 1.5 (256-1024, batch 1-4) due to the 24 GB build
  budget. Also call out the SDXL + ControlNet + TRT NotImplementedError
  so users hit it via doc rather than runtime surprise.
- Remove HANDOFF_TURBO_ONLY.md. The PR scope expanded well past
  "Turbo-only simplification": now covers DMD2 preset, scheduler
  timestep override, full SDXL TRT path with dynamic shape. Earlier
  handoff text is misleading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@happyFish happyFish changed the title Multi-model dropdown + Turbo-only simplification Turbo-only simplification + DMD2 preset + SDXL TRT support May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant