backport: feature-detect DSA experimental-attention spec for GLM-5/5.1/5.2#13
Open
yushengsu-thu wants to merge 3 commits into
Open
backport: feature-detect DSA experimental-attention spec for GLM-5/5.1/5.2#13yushengsu-thu wants to merge 3 commits into
yushengsu-thu wants to merge 3 commits into
Conversation
Older megatron-core (e.g. the radixark/miles image's 0.16.0rc0) only wires "gated_delta_net" in get_experimental_attention_variant_module_spec and raises ValueError for "dsa", and its get_dsa_module_spec_for_backend omits the metainfo the variant layer-builder reads. The GLM-5/5.1 bridge sets experimental_attention_variant="dsa" + transformer_layer_spec to that builder, so the model fails to build on such a core (LoRA and full-FT bridge paths alike). Wrap transformer_layer_spec in _build_glm5_dsa_block_spec: PREFER megatron-core's native handling, and only when it raises for "dsa" back-fill via the shipped DSA builder + set metainfo["fuse_input_layernorm"]=False (MLA-based DSA keeps a separate, non-fused input layernorm, like the deepseek_v4 dsv4 spec). On newer megatron-core (which handles "dsa" natively + sets metainfo) this is a transparent no-op, so the helper self-disables and can be deleted once the runtime core is bumped. Same spirit as the other miles-compat backports (mimo.config.role, training.config, parse_hybrid_pattern). Verified e2e: GLM-5.1 6-layer GRPO LoRA via bridge with no caller-side patch -> Job succeeded + PEFT adapter saved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yusheng Su <yushengsu.thu@gmail.com>
…dim) GLM-5.2 sets rope_theta=8e6, and under transformers>=5.12 the parsed GlmMoeDsaConfig reports qk_rope_head_dim as head_dim (192) instead of the config.json value (64). The base config-mapping then sized MLA linear_kv_down_proj as kv_lora_rank + 192 = 704, contradicting the checkpoint (kv_a_proj_with_mqa = kv_lora_rank + qk_rope_head_dim = 576 = 512 + 64). - rotary_base: read rope_theta whether nested in rope_parameters or flat. - qk_pos_emb_head_dim: re-read qk_rope_head_dim straight from config.json so MLA rope/kv dims match the weights. No-op when the parse is already correct; GLM-5.1 is unaffected (its head_dim already equals qk_rope_head_dim = 64). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yusheng Su <yushengsu.thu@gmail.com>
GLM-5.2 keeps GLM-5.1's glm_moe_dsa arch but only "computing"/anchor layers carry the lightning indexer and compute the sparse top-k; "skip" layers reuse the most recent computing layer's top-k (HF config index_topk_freq>1 + index_skip_topk_offset). megatron-core's DSA is per-layer only, so this adds a Bridge-owned CrossLayerDSAttention(DSAttention): anchors publish topk_indices to a per-microbatch holder (packed_seq_params for thd, thread-local for bshd), skip layers drop their indexer (matching the subset checkpoint) and reuse the source anchor's top-k. get_glm5_crosslayer_dsa_spec calls megatron-core's exact get_dsa_module_spec_for_backend and only swaps core_attention.module. Feature-gated in _build_glm5_dsa_block_spec on dsa_index_topk_freq>1, so GLM-5.1 (no freq -> 1) keeps the existing per-layer path unchanged. No megatron-core edits. Validated: GLM-5.2 7-layer train-only e2e (build + 98G subset-ckpt load + cross-layer fwd/bwd + LoRA adapter) and GLM-5.1 6-layer full e2e regression both reach TRAIN EXIT 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yusheng Su <yushengsu.thu@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make the
radixark/Megatron-BridgeGLM-5 / 5.1 / 5.2 (glm_moe_dsa) bridge fully build on theradixark/milesimage's megatron-core (0.16.0rc0), for both LoRA and full-FT paths, with no megatron-core source change. Covers GLM-5.1 (DSA) and GLM-5.2 (DSA + cross-layer index sharing + rope 8e6).Files:
src/megatron/bridge/models/glm_moe_dsa/cross_layer_dsa.py(new),src/megatron/bridge/models/glm_moe_dsa/glm5_bridge.py.Commits
e4cd5f37— GLM-5.1: feature-detect the DSA experimental-attention specglm5_bridge.pysetsexperimental_attention_variant="dsa". On older megatron-core the dispatcher (get_experimental_attention_variant_module_spec) only wires"gated_delta_net"and raisesValueErrorfor"dsa", andget_dsa_module_spec_for_backendomits themetainfothe variant layer-builder reads → the model can't build.transformer_layer_specnow points at_build_glm5_dsa_block_spec, whose wrapped dispatcher prefers megatron-core's native handling and only on the"dsa"ValueError(old core) back-fills via the shipped builder + setsmetainfo["fuse_input_layernorm"]=False(MLA-based DSA keeps a separate, non-fused input layernorm, like thedeepseek_v4dsv4spec). ⇒ transparent no-op on newer megatron-core; deletable after a core bump. (Previously a caller-side monkey-patch in milesbridge_lora_helpers.py, now consolidated here.)2bdfa05d— GLM-5.2: parse the MLA rope dims (rotary_base+qk_pos_emb_head_dim)GLM-5.2 sets
rope_theta=8e6, and undertransformers>=5.12the parsedGlmMoeDsaConfigreportsqk_rope_head_dimashead_dim(192) instead of theconfig.jsonvalue (64). The base config-mapping then sizes MLAlinear_kv_down_projaskv_lora_rank + 192 = 704, contradicting the checkpoint (kv_a_proj_with_mqa = kv_lora_rank + qk_rope_head_dim = 576 = 512 + 64).rotary_base: readrope_thetawhether nested inrope_parametersor flat.qk_pos_emb_head_dim: re-readqk_rope_head_dimstraight fromconfig.jsonso the MLA rope/kv dims match the weights. No-op when the parse is already correct; GLM-5.1 unaffected (itshead_dimalready equalsqk_rope_head_dim= 64).74dade06— GLM-5.2: DSA cross-layer index sharing (CrossLayerDSAttention)GLM-5.2 keeps GLM-5.1's
glm_moe_dsaarch, but only "computing" layers carry the lightning indexer and compute the sparse top-k; "skip" layers reuse the most recent computing layer's top-k (HF configindex_topk_freq=4,index_skip_topk_offset=3→ computing Megatron-layers 1,2,3,7,11,…). megatron-core's DSA is per-layer only, so this adds a Bridge-ownedCrossLayerDSAttention(DSAttention):topk_indicesto a per-microbatch holder (packed_seq_paramsforthd, thread-local forbshd);del self.indexer(so the param set matches the subset checkpoint, which only stores indexer weights on computing layers) and reuse the source anchor's top-k viaunfused_dsa_fn.get_glm5_crosslayer_dsa_speccalls megatron-core's exactget_dsa_module_spec_for_backendand swaps onlysubmodules.core_attention.moduleto the subclass (so the MLA structure — fused qk-layernorm, indexer submodules — is inherited verbatim). Feature-gated in_build_glm5_dsa_block_specondsa_index_topk_freq > 1, so GLM-5.1 (no freq → 1) keeps the existing per-layer path byte-for-byte.Verified (via
radixark/miles, 4×H200, bridge mode, TP4/EP4, bshd, mbs1)jybsuper/GLM-5.2-7layer): buildCrossLayerDSAttention→ load the 98 GB subset checkpoint cleanly (proves the indexer is built only on computing layers 1,2,3,7) → cross-layer fwd/bwd (skip layers reuse anchor top-k, no holder assert) → save LoRA adapter →TRAIN EXIT 0.TRAIN EXIT 0— the feature-gate leaves the GLM-5.1 path unchanged (regression check).Notes
🤖 Generated with Claude Code