Skip to content

backport: feature-detect DSA experimental-attention spec for GLM-5/5.1/5.2#13

Open
yushengsu-thu wants to merge 3 commits into
bridgefrom
bridge-dev-glm
Open

backport: feature-detect DSA experimental-attention spec for GLM-5/5.1/5.2#13
yushengsu-thu wants to merge 3 commits into
bridgefrom
bridge-dev-glm

Conversation

@yushengsu-thu

@yushengsu-thu yushengsu-thu commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

What

Make the radixark/Megatron-Bridge GLM-5 / 5.1 / 5.2 (glm_moe_dsa) bridge fully build on the radixark/miles image's megatron-core (0.16.0rc0), for both LoRA and full-FT paths, with no megatron-core source change. Covers GLM-5.1 (DSA) and GLM-5.2 (DSA + cross-layer index sharing + rope 8e6).

Files: src/megatron/bridge/models/glm_moe_dsa/cross_layer_dsa.py (new), src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py.

Commits

e4cd5f37 — GLM-5.1: feature-detect the DSA experimental-attention spec

glm5_bridge.py sets experimental_attention_variant="dsa". On older megatron-core the dispatcher (get_experimental_attention_variant_module_spec) only wires "gated_delta_net" and raises ValueError for "dsa", and get_dsa_module_spec_for_backend omits the metainfo the variant layer-builder reads → the model can't build. transformer_layer_spec now points at _build_glm5_dsa_block_spec, whose wrapped dispatcher prefers megatron-core's native handling and only on the "dsa" ValueError (old core) back-fills via the shipped builder + sets metainfo["fuse_input_layernorm"]=False (MLA-based DSA keeps a separate, non-fused input layernorm, like the deepseek_v4 dsv4 spec). ⇒ transparent no-op on newer megatron-core; deletable after a core bump. (Previously a caller-side monkey-patch in miles bridge_lora_helpers.py, now consolidated here.)

2bdfa05d — GLM-5.2: parse the MLA rope dims (rotary_base + qk_pos_emb_head_dim)

GLM-5.2 sets rope_theta=8e6, and under transformers>=5.12 the parsed GlmMoeDsaConfig reports qk_rope_head_dim as head_dim (192) instead of the config.json value (64). The base config-mapping then sizes MLA linear_kv_down_proj as kv_lora_rank + 192 = 704, contradicting the checkpoint (kv_a_proj_with_mqa = kv_lora_rank + qk_rope_head_dim = 576 = 512 + 64).

  • rotary_base: read rope_theta whether nested in rope_parameters or flat.
  • qk_pos_emb_head_dim: re-read qk_rope_head_dim straight from config.json so the MLA rope/kv dims match the weights. No-op when the parse is already correct; GLM-5.1 unaffected (its head_dim already equals qk_rope_head_dim = 64).

74dade06 — GLM-5.2: DSA cross-layer index sharing (CrossLayerDSAttention)

GLM-5.2 keeps GLM-5.1's glm_moe_dsa arch, but only "computing" layers carry the lightning indexer and compute the sparse top-k; "skip" layers reuse the most recent computing layer's top-k (HF config index_topk_freq=4, index_skip_topk_offset=3 → computing Megatron-layers 1,2,3,7,11,…). megatron-core's DSA is per-layer only, so this adds a Bridge-owned CrossLayerDSAttention(DSAttention):

  • anchor layers compute + publish topk_indices to a per-microbatch holder (packed_seq_params for thd, thread-local for bshd);
  • skip layers del self.indexer (so the param set matches the subset checkpoint, which only stores indexer weights on computing layers) and reuse the source anchor's top-k via unfused_dsa_fn.

get_glm5_crosslayer_dsa_spec calls megatron-core's exact get_dsa_module_spec_for_backend and swaps only submodules.core_attention.module to the subclass (so the MLA structure — fused qk-layernorm, indexer submodules — is inherited verbatim). Feature-gated in _build_glm5_dsa_block_spec on dsa_index_topk_freq > 1, so GLM-5.1 (no freq → 1) keeps the existing per-layer path byte-for-byte.

Verified (via radixark/miles, 4×H200, bridge mode, TP4/EP4, bshd, mbs1)

  • GLM-5.2 7-layer train-only (jybsuper/GLM-5.2-7layer): build CrossLayerDSAttention → load the 98 GB subset checkpoint cleanly (proves the indexer is built only on computing layers 1,2,3,7) → cross-layer fwd/bwd (skip layers reuse anchor top-k, no holder assert) → save LoRA adapter → TRAIN EXIT 0.
  • GLM-5.1 6-layer full e2e (rollout → train → save): TRAIN EXIT 0 — the feature-gate leaves the GLM-5.1 path unchanged (regression check).

Notes

  • No megatron-core source change; everything is on the Bridge side and self-disables on a newer megatron-core.
  • sglang does not yet serve the GLM-5.2 cross-layer rollout, so the GLM-5.2 validation is training-side (train-only); GLM-5.1 runs the full rollout→train loop.

🤖 Generated with Claude Code

Older megatron-core (e.g. the radixark/miles image's 0.16.0rc0) only wires
"gated_delta_net" in get_experimental_attention_variant_module_spec and raises
ValueError for "dsa", and its get_dsa_module_spec_for_backend omits the metainfo
the variant layer-builder reads. The GLM-5/5.1 bridge sets
experimental_attention_variant="dsa" + transformer_layer_spec to that builder, so
the model fails to build on such a core (LoRA and full-FT bridge paths alike).

Wrap transformer_layer_spec in _build_glm5_dsa_block_spec: PREFER megatron-core's
native handling, and only when it raises for "dsa" back-fill via the shipped DSA
builder + set metainfo["fuse_input_layernorm"]=False (MLA-based DSA keeps a
separate, non-fused input layernorm, like the deepseek_v4 dsv4 spec). On newer
megatron-core (which handles "dsa" natively + sets metainfo) this is a transparent
no-op, so the helper self-disables and can be deleted once the runtime core is bumped.

Same spirit as the other miles-compat backports (mimo.config.role, training.config,
parse_hybrid_pattern). Verified e2e: GLM-5.1 6-layer GRPO LoRA via bridge with no
caller-side patch -> Job succeeded + PEFT adapter saved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yusheng Su <yushengsu.thu@gmail.com>
yushengsu-thu and others added 2 commits June 19, 2026 15:58
…dim)

GLM-5.2 sets rope_theta=8e6, and under transformers>=5.12 the parsed
GlmMoeDsaConfig reports qk_rope_head_dim as head_dim (192) instead of the
config.json value (64). The base config-mapping then sized MLA
linear_kv_down_proj as kv_lora_rank + 192 = 704, contradicting the checkpoint
(kv_a_proj_with_mqa = kv_lora_rank + qk_rope_head_dim = 576 = 512 + 64).

- rotary_base: read rope_theta whether nested in rope_parameters or flat.
- qk_pos_emb_head_dim: re-read qk_rope_head_dim straight from config.json so
  MLA rope/kv dims match the weights. No-op when the parse is already correct;
  GLM-5.1 is unaffected (its head_dim already equals qk_rope_head_dim = 64).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yusheng Su <yushengsu.thu@gmail.com>
GLM-5.2 keeps GLM-5.1's glm_moe_dsa arch but only "computing"/anchor layers
carry the lightning indexer and compute the sparse top-k; "skip" layers reuse
the most recent computing layer's top-k (HF config index_topk_freq>1 +
index_skip_topk_offset). megatron-core's DSA is per-layer only, so this adds a
Bridge-owned CrossLayerDSAttention(DSAttention): anchors publish topk_indices to
a per-microbatch holder (packed_seq_params for thd, thread-local for bshd), skip
layers drop their indexer (matching the subset checkpoint) and reuse the source
anchor's top-k. get_glm5_crosslayer_dsa_spec calls megatron-core's exact
get_dsa_module_spec_for_backend and only swaps core_attention.module.

Feature-gated in _build_glm5_dsa_block_spec on dsa_index_topk_freq>1, so GLM-5.1
(no freq -> 1) keeps the existing per-layer path unchanged. No megatron-core
edits. Validated: GLM-5.2 7-layer train-only e2e (build + 98G subset-ckpt load +
cross-layer fwd/bwd + LoRA adapter) and GLM-5.1 6-layer full e2e regression both
reach TRAIN EXIT 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yusheng Su <yushengsu.thu@gmail.com>
@yushengsu-thu yushengsu-thu changed the title backport: feature-detect DSA experimental-attention spec for GLM-5/5.1 backport: GLM-5.1/5.2 glm_moe_dsa bridge — DSA spec feature-detect + GLM-5.2 cross-layer index sharing Jun 20, 2026
@yushengsu-thu yushengsu-thu changed the title backport: GLM-5.1/5.2 glm_moe_dsa bridge — DSA spec feature-detect + GLM-5.2 cross-layer index sharing backport: feature-detect DSA experimental-attention spec for GLM-5/5.1/5.2 Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant