Turbo-only simplification + DMD2 preset + SDXL TRT support#2
Open
Turbo-only simplification + DMD2 preset + SDXL TRT support#2
Conversation
Five fixes that together let LCM-LoRA'd SD1.5, SDXL, SDXL-Turbo, and
Dreamshaper variants produce sharp deterministic txt2img output:
1. Auto-fuse the matching LCM LoRA (lcm-lora-sdv1-5 / lcm-lora-sdxl) for
non-Turbo bases. The schema declared use_lcm_lora but nothing wired it
up, so non-Turbo models were running LCMScheduler with un-distilled
UNet weights and outputting yellow/black blobs.
2. Swap SDXL's stock VAE for madebyollin/sdxl-vae-fp16-fix on load. The
stock VAE decodes NaN in fp16, so every SDXL frame was pure black.
3. SDXL conditioning (add_text_embeds, add_time_ids) now broadcasts to
the current batch size when t_index_list has multiple entries.
4. Per-family default num_inference_steps: 1 for sd-turbo proper, 4 for
everything else. Single-step at t=999 only converges for the model
distilled for that exact regime; SDXL-Turbo / Dreamshaper-XL-Turbo /
non-Turbo + LCM LoRA are blurry at 1 step and sharp at 4. Exposed as
the "Inference Steps" UI slider with an "Auto Inference Steps"
toggle to defer to per-family suggestion.
5. Two text-mode bugs in __call__:
- Image-loopback was implicit ("video missing AND prev_image_result
exists"), making each frame feed its previous output back as input
and drift to over-saturated abstract patterns. Now opt-in only.
- Input latent used unseeded torch.randn each call, so seed=42 still
produced a different scene per frame. Now reuses the seeded
init_noise[0:1] for stable, deterministic output.
Verified across sd-turbo, SD1.5, Dreamshaper-8, SDXL-Turbo, SDXL-Base,
and Dreamshaper-XL-v2-Turbo at 512 / 1024.
Limits the UI to the six model IDs verified to produce sharp output via this pipeline's auto-LCM-LoRA + fp16-fix-VAE plumbing. Also adds the field to the UI surface (was previously schema-only).
The schema field is named ``model_id_or_path`` and Scope's pipeline_manager merges schema defaults into __init__ kwargs by their declared name, but __init__ only read ``model_id`` — so picking a model in the UI was silently ignored and the default reloaded every time.
Scope routes model_id_or_path through setNodeParams (the runtime/kwargs path), not through pipeline/load. Previously __call__ ignored the incoming value, so picking a different model in the UI updated logs but left the original weights loaded. Detect a mismatch against self.model_id and reload the weights in place — re-attaching the LCM LoRA / fp16-fix VAE per family, freeing the old pipe first to avoid 2x VRAM, and invalidating prompt / timestep / noise caches so the next frame rebuilds against the new model.
…e field Use the same config/kwargs lookup path that strength, seed, etc. use, instead of a hand-rolled kwargs.get() ahead of the rest of __call__.
… channels StreamDiffusion's batch denoising emits one frame per __call__ but each frame is at a different t_index in the cycle (frame i -> t_index i mod N). Across video that smooths out; for a steady text prompt it shows up as N different denoising stages flashing one after another. Switch to sequential denoising (all N steps inside one __call__) when there's no video input and the schedule has >1 step, so each frame is one fully denoised image.
…'t flash channels" This reverts commit 8315c82.
Adds _predict_x0_serial as a sibling of _predict_x0_batch and routes to it when num_inference_steps > 1 in steady-prompt modes (no video input, or explicit image_loopback) with ControlNet off. Walks the full N-step LCM schedule inside one __call__, so each emitted frame is one fully denoised image instead of one slot of the rolling N-track buffer cycle that otherwise flashes N different attractors at the camera. The batch path still owns: - num_inference_steps == 1 (degenerates to one UNet call anyway, and it's the path SD-Turbo and the depth/scribble ControlNet pre-passes expect) - video input / v2v streams (where the buffer reuse trick actually amortises across consecutive related frames — its design point) - ControlNet streams (same reasoning) Routing decision is a single boolean (`use_serial`) computed alongside the other extracted params; the rest of __call__ branches on it exactly twice — once to skip auto-noising the encoded image (serial adds its own noise based on `strength`) and once to pick the predict function. Batch path is untouched.
Scope rebuilds the plugin instance on every graph edit, which clears the in-memory `_trt_*_built` flags and forces a per-engine deserialize/bind cycle (visible stalls of hundreds of ms to seconds, plus the rare full ONNX→TRT compile). Hold the built adapters at module scope keyed by the graph node id so the new instance can swap them straight back in. - New `_trt_cache.py`: `CachedTRTState` (cuda_stream, unet_adapter, unet_has_controlnet, cn_adapters dict, taesd_adapter) keyed by `node:<id>`, with signature `(model_id, height, width)` so a real config change still triggers a clean rebuild. - `pipeline.py`: read `node_id` from kwargs (Scope must pass it through; until that lands, falls back to `_anon_<model_id>` — correct for the single-SD-node case). At first `_ensure_trt_*` call, look up the cache; on hit, swap `self.unet` / `self.controlnet` / `self.vae` to the cached adapter and skip the build. On miss, build then write back.
Replaces Literal[256, 320, ...] tuple on width/height with a Resolution IntEnum and a `mode='before'` field_validator that coerces ints into enum members and raises a clear error listing all allowed values otherwise. Pipeline code already wraps width/height in `int()`, so behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # src/scope_streamdiffusion/pipeline.py # src/scope_streamdiffusion/schema.py
Trim model_id_or_path enum to stabilityai/sd-turbo and stabilityai/sdxl-turbo — both 1-step distillations. Drops Dreamshaper, SD 1.5 base, SDXL base, and the Dreamshaper SDXL Turbo variant: keeping the multi-step models meant carrying LCM LoRA fusion + a serial denoise path that we no longer need. Removes num_inference_steps and use_suggested_num_inference_steps fields: both are dead now that step count is fixed at 1 for every supported model. LoRA-based step distillation (Hyper-SD / Lightning) on arbitrary checkpoints is the better path forward — tracked separately, not in this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the schema only allows SD-Turbo and SDXL-Turbo, the runtime can shed everything that existed to make non-Turbo models usable at low step counts: - self.sd_turbo flag (everything is Turbo now) and the per-family step-count branch in __call__ - _attach_lcm_lora() and its call sites in __init__ / _swap_model (LCM LoRA was only fused for non-Turbo SD 1.5 / SDXL bases) - _predict_x0_serial() and the use_serial branch in __call__ — serial denoise was added for steady-prompt txt2img / image-loopback on multi-step models; with 1-step Turbo it never fires - denoising_steps_num > 1 dead branches in _prepare_runtime_state and _predict_x0_batch (always 1 now) - num_inference_steps plumbing — pinned at 1 in __call__ Untouched: TRT engine swap, ControlNet handling, prompt transitions, RCFG, mask compositing, hot-swap between sd-turbo and sdxl-turbo, and the SDXL fp16-fix VAE swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a curated 1-step model option that isn't a direct HuggingFace repo: SDXL-base with the DMD2-distilled UNet (tianweiy/DMD2) swapped in. DMD2 generally outperforms SDXL-Turbo on FID/CLIP per the paper, while staying on the same LCMScheduler at 1 step that all our existing TRT/runtime infra is built around. Introduces a MODEL_PRESETS dict at module scope as the extension point for future Turbo-class additions: - 'unet_swap' shape — base pipeline + distilled UNet checkpoint. Used here for DMD2; DMD2 retrained the UNet via distribution matching, so it ships as a UNet, not a LoRA. - Future shapes documented inline: 'lora' (Hyper-SD / SDXL-Lightning step-distillation LoRAs), 'scheduler' override, 'timesteps_override'. Hyper-SD-1step / Lightning-1step both need TCD / Euler schedulers, which require a `_set_timesteps` refactor (the current path calls LCM-specific `get_scalings_for_boundary_condition_discrete` and reads `scheduler.alphas_cumprod` directly). That refactor is out of scope for this PR. The fp16-fix VAE swap, TRT cache keying, hot-swap, and rolling-buffer denoise math are all untouched — DMD2's UNet is architecturally an SDXL UNet, so everything downstream of `_load_model` is identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Distilled-UNet repos like tianweiy/DMD2 ship weights only — no config.json — because the architecture is identical to the base UNet. UNet2DConditionModel.from_pretrained needs a config and bails with 'tianweiy/DMD2 does not appear to have a file named config.json'. Switch to: load the base SDXL pipeline (gets a correctly-configured UNet module), download the DMD2 checkpoint via hf_hub_download, then override the UNet's state_dict in place. Verified end-to-end with a 300-frame sun→moon morph render at fp16, no acceleration: 6 fps eager, output matches expected DMD2 quality. Same pattern works for SDXL-Lightning's 1-step UNet variant once the scheduler refactor lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DMD2-1step is distilled at a single specific timestep. Letting LCMScheduler pick the default 1-step (~979, near pure-noise endpoint) feeds the model a timestep it was never trained on and produces garbage — visually a blurry monochrome blob with no recognizable features. Add a `timesteps_override` field to MODEL_PRESETS and have `_set_timesteps` honor it when present. With the override pinned at [399] (the DMD2 paper's documented training timestep for SDXL 1-step), the model produces clean photographic output: a recognizable sun / moon with proper composition, contrast, and detail. Same mechanism will land Hyper-SDXL-1step (timesteps=[800]) once the broader scheduler-class refactor on feat/scheduler-refactor catches up; this commit just gets DMD2 to a usable state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the missing SDXL-shaped TRT path so acceleration_mode='trt' works
on SDXL-Turbo and DMD2-distilled UNets. Eager-only on SDXL was a
pre-existing limitation: ONNX export crashed in get_aug_embed because
the export wrapper passed added_cond_kwargs=None instead of the SDXL
{text_embeds, time_ids} dict. End-to-end this commit:
* UNetSDXL I/O spec — 5 inputs (sample, timestep, encoder_hidden_states,
text_embeds dim=1280, time_ids dim=6) instead of SD 1.5's 3.
* UNetSDXLExportWrapper — wraps the diffusers UNet so text_embeds/time_ids
are positional args for ONNX trace, reconstructed into added_cond_kwargs
at the inner forward.
* UNet2DConditionModelSDXLEngine — runtime engine wrapper feeding all 5
named inputs to the TRT context.
* compile_unet_sdxl — same shape as compile_unet but routes through the
SDXL wrapper. Skips the polygraphy ONNX optimizer (passes the same path
twice for raw + "opt") because polygraphy's optimizer OOMs on the ~5 GB
SDXL ONNX; TRT's builder does its own graph optimization.
* export_onnx — adds use_external_data flag (torch 2.9 param `external_data`)
so SDXL UNet's >2 GB ONNX serializes correctly. Post-processes the
raw export to consolidate ~1500 per-tensor sidecar files into one
weights.bin: pytorch's per-tensor location-only entries trip TRT's
WeightsContextMemoryMap on certain initializers ("Failed to open file").
* build_unet_sdxl_engine + TRTUNetSDXLAdapter — build/load. Engine is
static-shape (build_dynamic_shape=False) and static-batch (max=1).
SDXL's tactic exploration over a dynamic shape envelope OOMs even on
24 GB VRAM; static-shape collapses the search space enough to fit.
Engine is only valid at the (h,w) it was built for — resolution
changes will rebuild.
* _ConfigShim — gains an `sdxl=True` mode returning the SDXL
cross_attention_dim=2048 and addition_time_embed_dim=256 the
pipeline reads to size add_time_ids. TRTUNetSDXLAdapter also
fakes an `add_embedding.linear_1.in_features=2816` shim because
the SDXL pipeline introspects that attribute on UNet.
* pipeline._ensure_trt_unet — accepts explicit image_height/width
args. Static engines need the *real* runtime dims at build time;
self.height/self.width are still init defaults (512x512) when this
method runs because _prepare_runtime_state hasn't executed yet.
Pre-emptively setting self.{height,width} would block dims_changed
in _prepare_runtime_state and leave self.latent_{height,width} at
init defaults — engine and inference would mismatch in the other
direction.
* SDXL build flow moves VAE + text encoders to CPU during the TRT
build to free VRAM for the builder's TACTIC_DRAM allocation, then
moves them back. UNet stays on GPU (the ONNX tracer needs it there).
Verified end-to-end on a 4090:
- SDXL-Turbo @ 1024x1024: 91 ms/frame eager → 11 ms/frame TRT (8.3x)
- DMD2-SDXL-1step @ 1024x1024: 91 ms/frame eager → 11 ms/frame TRT (8.3x)
- Output is byte-different but visually equivalent to eager, confirming
correct numerical behavior.
Build-time prerequisites the wheel install model alone doesn't satisfy
(documented in trt_engines.py header):
- LD_LIBRARY_PATH must include the venv's tensorrt_libs at process
exec time. Loader's lazy dlopen of the per-SM kernel library
(libnvinfer_builder_resource_smXX.so.10.x) bypasses ldconfig because
those libs have a do_not_link_against_* SONAME, so cache lookup by
filename fails. ctypes preload from inside Python is too late —
the dynamic linker reads LD_LIBRARY_PATH at exec time only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Static-shape engines locked the build to a single (h, w) — any resolution or aspect-ratio change required a 5–10 min rebuild. Replaced with a dynamic-shape build over the [512, 1024] envelope on both axes. Same cached engine now serves any in-range resolution. Verified end-to-end: a single cached engine handles 1024x1024, 1024x768 (landscape), and 768x1024 (portrait) without a rebuild. Composition adapts to the aspect (wide horizon vs. tall cloud column). Trade-offs vs. static-shape: - Steady-state at the opt point (1024x1024): 11 ms/frame → 14 ms/frame. ~27% slowdown for the flexibility, expected. - Build memory: 512-1024 envelope on a 24 GB card with VAE+text-encoders on CPU during build → fits cleanly. Wider envelopes (256-1024) blew past the budget; 512-1024 is the practical sweet spot. - Engine size: ~5.2 GB on disk (similar to static). Cache key now encodes the resolution range (`unet_sdxl_b1-1_h512-1024_w512-1024`) instead of the opt point, so engines don't collide across resolution choices and any in-range run hits the same cached file. Static batch (max=1) is kept — guidance_scale=0 is the only mode for Turbo / DMD2, so dynamic batch would just double workspace cost for no inference benefit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Update acceleration_mode description: actual measured speedup is 2-8x (was "~2-3x"), and SDXL engines have a different envelope (512-1024, batch=1) than SD 1.5 (256-1024, batch 1-4) due to the 24 GB build budget. Also call out the SDXL + ControlNet + TRT NotImplementedError so users hit it via doc rather than runtime surprise. - Remove HANDOFF_TURBO_ONLY.md. The PR scope expanded well past "Turbo-only simplification": now covers DMD2 preset, scheduler timestep override, full SDXL TRT path with dynamic shape. Earlier handoff text is misleading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What started as a "trim the multi-model branch back to Turbo only" PR
has grown into a full SDXL/TRT enablement push. End-to-end: the plugin
now ships a tested SDXL + DMD2 path with dynamic-shape TRT acceleration
at 1024×1024.
What's in the dropdown now
stabilityai/sd-turbo— SD 1.5 1-step, eager + TRT (existing path).stabilityai/sdxl-turbo— SDXL 1-step, eager + TRT (new in this PR).dmd2-sdxl-1step— SDXL base + DMD2 distilled UNet, eager + TRT(new in this PR). Preset entry handles the model assembly + the
[399]training timestep override DMD2 needs.Verified perf on a 4090 at 1024×1024
Dynamic-shape engine handles 1024×1024, 1024×768 (landscape), and
768×1024 (portrait) with no rebuild — confirmed.
What's removed
_attach_lcm_lora,_predict_x0_serial,denoising_steps_num > 1branches). All supported models are1-step distillations now.
num_inference_stepsanduse_suggested_num_inference_stepsschemafields (dead at 1-step).
Lykon/dreamshaper-*,stable-diffusion-v1-5/...,stabilityai/stable-diffusion-xl-base-1.0entries from the dropdown.HANDOFF_TURBO_ONLY.md(committed early when scope was just Turbo).What's added
Schema / pipeline (the easy half)
MODEL_PRESETSdict inpipeline.pyas the extension point forcurated multi-piece recipes. DMD2 lives there as
(base_model, unet_swap, timesteps_override)._load_preset()forunet_swap-shape recipes:load SDXL base → download DMD2 UNet via
hf_hub_download→override
pipe.unet.state_dict. DMD2's repo ships weights only(no
config.json), sofrom_pretraineddoesn't work directly.timesteps_overrideplumbed through_set_timesteps. DMD2-1step isdistilled at t=399 specifically; LCMScheduler's default 1-step
picks ~t=979 and gets garbage out.
TRT (the hard half)
UNetSDXLI/O spec +UNetSDXLExportWrapper— addstext_embeds(1280) andtime_ids(6) as named ONNX inputs soSDXL's
get_aug_embeddoesn't crash onadded_cond_kwargs=Noneduring export.
UNet2DConditionModelSDXLEngineruntime adapter feeding all 5inputs.
compile_unet_sdxlskips polygraphy's ONNX optimizer (passes thesame path twice for raw + opt) — polygraphy OOMs on the ~5 GB SDXL
ONNX, and TRT's builder does its own graph optimization anyway.
external_data=True(torch 2.9 name; wasuse_external_data_formatpre-2.5) so the >2 GB SDXL UNet ONNXserializes correctly. Post-processes the raw export to consolidate
the ~1500 per-tensor sidecar files into one
.weightsblob —pytorch's location-only entries trip TRT's
WeightsContextMemoryMapon certain initializers ("Failed to open file" on a file that exists).
build_unet_sdxl_engine+TRTUNetSDXLAdapter. Dynamic-shape buildover [512, 1024] on both axes, static batch=1 (guidance_scale=0
means inference never uses batch>1, dynamic batch would just double
workspace cost). Engine cache key encodes the resolution range so
in-range runs hit the same cached file.
_ensure_trt_unetaccepts explicitimage_height/image_widthargs.
_prepare_runtime_state(which setsself.height/self.width)hasn't run yet when this method fires, so without explicit dims the
build sized for
__init__defaults (512×512) and mismatched atinference. Pre-emptively setting
self.{height,width}would blockdims_changedin_prepare_runtime_stateand leave latent dimsstale — explicit-args is the cleanest cut.
for TRT's TACTIC_DRAM allocation. UNet stays on GPU (the ONNX tracer
needs it there).
_ConfigShim(sdxl=True)returns the SDXLcross_attention_dim=2048and
addition_time_embed_dim=256. The SDXL adapter also stubsadd_embedding.linear_1.in_features=2816because the pipelineintrospects that surface to size
add_time_ids.Host-side prerequisites for TRT
These were discovered the hard way and aren't in the wheel install
flow (yet) — I left them as deployment notes:
LD_LIBRARY_PATHmust include the venv'stensorrt_libsdir atprocess exec time. TRT lazy-dlopens
libnvinfer_builder_resource_smXX.so.10.xduring engine build.That lib has SONAME
do_not_link_against_*so the ldconfig cachemisses on filename lookup. ctypes preload from inside Python is
too late — the dynamic linker reads
LD_LIBRARY_PATHat exectime only.
sublibs but isn't sufficient on its own (per Add Daydream node page link to README #1).
the main libs but don't cover the SM-specific lazy-loaded ones.
Known limitations (deferred)
NotImplementedError. Theexisting
UNetWithControlInputsandControlNetmodel specs areSD-1.5-shaped (12 down residuals at the (320, 640, 1280, 1280)
pattern); SDXL has a different shape. Doable but ~500 LoC of new
TRT plumbing mirroring the existing SD 1.5 ControlNet path.
double TRT workspace cost; with guidance_scale=0 always being the
Turbo / DMD2 mode, no inference benefit.
build memory budget on a 24 GB card. Practical sweet spot.
prompts ("the sun"). Both produce excellent output on descriptive
prompts (oil-painting style + lighting cues). Documented in the
schema description.
Test plan
python -m py_compile src/scope_streamdiffusion/*.py src/scope_streamdiffusion/_trt/*.py— cleanacceleration_mode='trt'on SDXL-Turbo / DMD2 in a real Scope session (not just standalone scripts)🤖 Generated with Claude Code