Skip to content

Zty dev qwen35#3

Open
benenzhu wants to merge 44 commits into
mainfrom
zty_dev_qwen35
Open

Zty dev qwen35#3
benenzhu wants to merge 44 commits into
mainfrom
zty_dev_qwen35

Conversation

@benenzhu
Copy link
Copy Markdown

What

Why

How

Testing

  • pre-commit run --all-files passes
  • Tests pass (pytest tests/)
  • New tests added (if applicable)
  • Documentation updated (if applicable)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • CI/CD or build changes

Screenshots / Logs

Yangruipis and others added 30 commits April 26, 2026 13:58
# 📝 Documentation

## Drop references to non-existent 4B launch script

- Remove "4B Model (8 GPUs)" quick-start section from en/zh docs
- Swap model download snippet to `Qwen3-VL-30B-A3B-Thinking`
- Update Ray Job example to invoke `run_deepeyes.sh`
- Drop `run_deepeyes_4b.sh` entry from file-structure listing
# ⭐ Feature

## Add Qwen3.5-9B ROCm smoke path

- Add one-command Qwen3.5-9B DAPO-Math smoke runner that restarts Ray from a clean state.
- Align Qwen3.5 runtime flags with the validated Qwen3-4B ROCm path: TE auto attention, BSHD, and disabled Dynamo/JIT fuser.
- Reduce the default Qwen3.5 run to a smaller smoke profile to avoid filling SGLang KV/logits memory during initial validation.

Made-with: Cursor

---

# 🐛 Bug Fix

## Install ROCm training dependencies in Dockerfile

- Install ROCm TransformerEngine 2.4.0 wheels for Megatron TE specs.
- Install flash-linear-attention for Qwen3.5 GatedDeltaNet initialization.

---

# 📝 Documentation

## Record Qwen3.5 ROCm bring-up results

- Document FLA import resolution, SGLang rollout device ordinal fix, and the full-profile rollout OOM finding.
- Record that the smoke reached all service registration and step-0 training/logprob flow.

---

# 🎨 Style

## Apply pre-commit formatting

- Apply import ordering, doc formatting, script copyright, and formatting fixes from pre-commit.
Resolve AMD runner conflicts after the Qwen3-4B ROCm PR landed on main.

Keep the Qwen3.5 smoke runner changes while preserving the main branch's Qwen3-4B baseline, and avoid hardcoded private IP defaults in local smoke wrappers.

Made-with: Cursor
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
# 🐛 Bug Fix

## Avoid debug logging failures during tiny smoke runs

- Skip rollout metric logging when the transient training debug buffer does not include response length metadata yet.
- Keep training from failing before the rollout/logprob path has completed.

Made-with: Cursor

---

# ⭐ Feature

## Tighten Qwen3.5-9B smoke defaults

- Use a smaller global batch for the reduced smoke profile while preserving GRPO grouping constraints.
- Record the latest validation findings in the AMD runbook.
# 🐛 Bug Fix

## Correct judge prompt spelling in DeepEyes reward

- replace `Judgement` and `Judement` with `Judgment` in few-shot examples and prompt text
- update the prompt suffix to request `Judgment:` consistently
- keep response parsing backward compatible with both `Judgment:` and `Judgement:` labels
    # ⭐ Feature

    ## Add unified device abstraction layer (`relax/utils/device.py`)

    - Introduce `AcceleratorType` enum: CUDA, NPU, XPU, PPU, ROCM, CPU
    - Auto-detect hardware via `_detect_accelerator()` with priority-based probing
    - Support `RELAX_DEVICE_TYPE` env var override for debugging
    - Provide 25+ thin-wrapper APIs: `current_device()`, `set_device()`, `synchronize()`,
      `empty_cache()`, `Stream()`, `Event()`, `stream_context()`, `is_initialized()`, etc.
    - Map distributed backends: CUDA→nccl, NPU→hccl, XPU→xccl, PPU→eccl
    - Map Ray resource names: CUDA/ROCm→GPU, NPU→NPU, XPU→XPU
    - Map visible-devices env vars per accelerator type
    - Abstract NUMA affinity with graceful degradation for non-CUDA backends

    ---

    # ♻️ Refactor

    ## Replace hardcoded `torch.cuda.*` calls across 20+ files

    - Replace `torch.cuda.current_device()` → `device_utils.current_device()`
    - Replace `torch.cuda.set_device()` → `device_utils.set_device()`
    - Replace `torch.cuda.synchronize()` → `device_utils.synchronize()`
    - Replace `torch.cuda.empty_cache()` → `device_utils.empty_cache()`
    - Replace `torch.cuda.Stream/Event` → `device_utils.Stream()/Event()`
    - Replace `torch.cuda.mem_get_info()` → `device_utils.mem_get_info()`
    - Replace `torch.cuda.device_count()` → `device_utils.device_count()`
    - Replace `torch.device("cuda:...")` → `device_utils.make_current_torch_device()`
    - Replace `device="cuda"` → `device=device_utils.get_device_name()`
    - Replace `"nccl"` backend → `device_utils.get_dist_backend()`
    - Replace `"GPU"` Ray resource → `device_utils.get_ray_accelerator_name()`
    - Replace `CUDA_VISIBLE_DEVICES` → `device_utils.get_visible_devices_env_var()`
    - Wrap CUDA-specific memory profiling APIs with `hasattr` guards
    - Add CUDA-only annotation to `int4_qat/setup.py` kernel build script
# ⭐ Feature

## Add Qwen3.5-9B single-node fully-async training script

- Add `run_deepeyes_qwen35_9B_async.sh` for 8xGPU fully-async DeepEyes training
- Resource layout: actor(4) + rollout(2) + reference(1) + actor_fwd(1)
- Use `--use-dynamic-batch-size` and `--no-rope-fusion` per latest Qwen3.5 conventions

---

# 🐛 Bug Fix

## Add 0-1000 normalized bbox coordinate conversion

- Qwen-VL/Qwen2-VL/Qwen3-VL output 0-1000 normalized coords but `_maybe_resize_bbox` treated them as absolute pixels
- Add coordinate conversion step before clamping in `_maybe_resize_bbox`
- Add `normalize_bbox` parameter to `DeepeyesEnv` (default True) for model-specific control
- Qwen2.5-VL users can set `normalize_bbox: false` since it outputs absolute pixel coords
- Wire `normalize_bbox` through `build_env` from custom config
…sion

# 🐛 Bug Fix

## Fix IndexError on 1D tensor transpose in fully-async weight sync

- Qwen3.5 Bridge outputs expert gate_up_proj as 2D [2*H, D] (cat, no transpose),
  unlike Qwen3-VL which outputs 3D [2, D_out, D_in] (stack + transpose)
- Qwen3.5 ExpertMLPDownProjMapping inherits AutoMapping (no transpose),
  unlike Qwen3-VL which overrides megatron_to_hf with transpose
- Add ndim-based branching in _convert_to_hf_bridge post-processing:
  3D → Qwen3-VL path (undo transpose + index), 2D → Qwen3.5 path (chunk)
- Detect bridge_expert_transposes_down at init time via __dict__ introspection
  to decide whether down_proj needs un-transpose

---

# ✅ Tests

## Add Qwen3.5 Bridge expert weight conversion tests

- Add TestQwen35BridgeMappingOutput: verify 2D cat output, no-transpose
  down_proj, and megatron_to_hf override detection
- Add TestQwen35PostProcessingCorrectness: end-to-end gate_up split and
  down_proj passthrough correctness
- Update _apply_expert_postprocessing helper to accept bridge_expert_transposes_down param
@benenzhu
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves AMD/ROCm support for Qwen3.5-9B, introducing dynamic GPU selection, TensorBoard integration, and a comprehensive validation plan. It adds a suite of diagnostic tools for performance analysis and network monitoring, alongside architectural refinements like actor warmup and optimized process group reuse for weight synchronization. Additionally, it includes detailed reading notes and text extractions for the Relax paper. Review feedback identifies a hardcoded path in the runner script, fragile log filtering logic in a.sh, commented-out installation steps in the Dockerfile, and a regression in port probing accuracy due to the removal of the SO_REUSEADDR socket option.

Comment thread amd/run_qwen35-9b.sh

echo "=== Building bindlog LD_PRELOAD hook inside zty_relax ==="
bash -lc \
"cd /B/Relax && gcc -shared -fPIC -O2 -Wall -Wextra -o tools/libbindlog.so tools/bindlog.c -ldl -pthread"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The path /B/Relax is hardcoded and specific to a local environment. This will cause the script to fail on other systems. Use the ${REPO_ROOT} variable instead.

Suggested change
"cd /B/Relax && gcc -shared -fPIC -O2 -Wall -Wextra -o tools/libbindlog.so tools/bindlog.c -ldl -pthread"
"cd ${REPO_ROOT} && gcc -shared -fPIC -O2 -Wall -Wextra -o tools/libbindlog.so tools/bindlog.c -ldl -pthread"

Comment thread a.sh Outdated
@@ -0,0 +1 @@
grep -En "Finish rollout [0-9]+/[0-9]+|Start rollout [0-9]+/[0-9]+|rollout_id: [0-9]+|loss|grad_norm|Traceback|RayTaskError|DistBackendError|Actor training failed|detected as unhealthy|GLOBAL restart|triggering restart|The actor died|Call to bind failed|Address already in use|Exception|ERROR|perf|Starting rollout step" $1 |grep -v "relax.backends.megatron.actor:629" |grep -v "NestedTensor" |grep -v "relax.backends.megatron.actor:666" |grep -v "relax.components.advantages:83" |grep -v "relax.backends.megatron.actor:688" |grep -v "relax.utils.utils:449" |grep -vi "middleware.rs:349" |grep -vi "Successfully got rollout_id" |grep -vi "start to get rollout_id" |grep -vi "for t in rollout_data" |grep -vi "rollout sample"|grep -vi "Override" |grep -vi "transfer_queue.utils.perf_utils"|grep -vi "Finish rollout:"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Filtering logs based on hardcoded line numbers (e.g., relax.backends.megatron.actor:629) is extremely fragile. If the source file is modified, these line numbers will change, causing the filter to fail or suppress the wrong logs. It is recommended to filter based on unique log message patterns instead.

Comment thread docker/Dockerfile.rocm
Comment on lines +155 to +157
# WORKDIR /root/
# COPY . .
# RUN pip install -e . --no-deps
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These lines are commented out, which prevents the relax package from being installed in the Docker image. As noted in the TODO on line 154, these should be uncommented once testing is complete to ensure the image is functional.

Comment on lines 577 to 578
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(("", port))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The removal of SO_REUSEADDR during port probing can lead to false negatives if a port is in the TIME_WAIT state. It is generally better to keep this option enabled to ensure the probe accurately reflects port availability for new connections.

                with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
                    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
                    sock.bind(("", port))

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for AMD ROCm and Qwen3.5-9B, featuring dynamic GPU selection, process group reuse for weight synchronization, and a suite of performance analysis and debugging tools. It also includes detailed paper reading notes and validation plans. Feedback identifies a hardcoded path in the runner script, the accidental commenting out of the package installation in the Dockerfile, fragile sed commands in the build process, and the removal of SO_REUSEADDR which may impact port binding reliability.

Comment thread amd/run_qwen35-9b.sh

echo "=== Building bindlog LD_PRELOAD hook inside zty_relax ==="
bash -lc \
"cd /B/Relax && gcc -shared -fPIC -O2 -Wall -Wextra -o tools/libbindlog.so tools/bindlog.c -ldl -pthread"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The path /B/Relax is hardcoded, which will cause the script to fail in any environment where the repository is cloned to a different location. Use the ${REPO_ROOT} variable instead to ensure portability.

Suggested change
"cd /B/Relax && gcc -shared -fPIC -O2 -Wall -Wextra -o tools/libbindlog.so tools/bindlog.c -ldl -pthread"
"cd ${REPO_ROOT} && gcc -shared -fPIC -O2 -Wall -Wextra -o tools/libbindlog.so tools/bindlog.c -ldl -pthread"

Comment thread docker/Dockerfile.rocm
Comment on lines +154 to +157
# TODO(zty): remove this comment after testing
# WORKDIR /root/
# COPY . .
# RUN pip install -e . --no-deps
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The installation of the relax package is currently commented out. This will result in a Docker image that lacks the application code, causing the training entrypoint to fail at runtime. These lines should be uncommented and the working directory set correctly.

WORKDIR /root/Relax
COPY . .
RUN pip install -e . --no-deps

Comment thread docker/Dockerfile.rocm
cd aiter && \
git checkout "${AITER_COMMIT}" && \
git submodule update --init --recursive && \
sed -i '459 s/if.*:/if False:/' aiter/ops/triton/attention/pa_mqa_logits.py && \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a hardcoded line number (459) in sed is extremely fragile. If the upstream aiter repository is updated and the file content shifts, this command will either fail or modify the wrong line. It is recommended to use a pattern-based match to identify the target line.

Comment on lines 577 to 578
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(("", port))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SO_REUSEADDR socket option was removed in this version. This can lead to "Address already in use" errors if the port is in a TIME_WAIT state from a previous execution, making the free port discovery process less reliable. It should be restored to match the previous behavior.

                with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
                    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
                    sock.bind(("", port))

benenzhu and others added 14 commits April 30, 2026 05:12
# ⭐ Feature

## Add PrefetchBuffer for streaming multimodal datasets

- Implement `PrefetchBuffer` with `set_index_order` pattern (inspired by AReaL)
- Background thread uses `ThreadPoolExecutor` for parallel video decoding (PyAV releases GIL)
- Flow-controlled cache with `max_cached` bound and `_space_available` Event
- Add `--prefetch-chunk-size` and `--prefetch-max-cached` CLI arguments
- Wire arguments through `data_source.py` into `StreamingDataset`

## Add SGLang engine profiling support

- Add 7 profiling CLI arguments (`--sglang-profile`, `--sglang-profile-output-dir`, etc.)
- Implement `_start_sglang_profile` / `_stop_sglang_profile` in sglang_rollout
- Support `num_steps` auto-stop, per-stage profiling, stack and shape recording

## Add pipeline batch encoding for multimodal generate

- Add `--mm-encode-batch-groups-size` argument for encode-generate overlap
- Split samples into batches: encode batch N, send to SGLang, encode batch N+1 concurrently
- Implement `_encode_multimodal_inputs` for base64 encoding of images/videos/audio

---

# 📝 Documentation

## Sync new parameters to configuration docs

- Add prefetch parameters to Dataset section (en/zh)
- Add `--mm-encode-batch-groups-size` to Multimodal Data section (en/zh)
- Add 7 SGLang profiling parameters to SGLang Engine Parameters section (en/zh)
# 🐛 Bug Fix

## Fix torch.cuda patch broken by device abstraction refactor

- The device abstraction commit (632b29c5) replaced hardcoded
  `torch.cuda.get_device_properties` / `torch.cuda.get_device_capability`
  patch targets with `torch.{device_utils.get_device_name()}.*`
- When no accelerator is available, `get_device_name()` returns `"cpu"`,
  so patches targeted `torch.cpu.*` instead of `torch.cuda.*`
- Megatron validate_args internally always calls `torch.cuda.*`,
  so the patches must target `torch.cuda` regardless of device abstraction
- Restore hardcoded `torch.cuda.*` patch targets with explanatory comment
# 🔩 Chore

## Remove dead helpers from data module

- Delete the unreferenced `filter_long_prompt` helper from `relax/utils/data/data.py`
- Delete the dead `_build_messages` helper that was shadowed by `relax/utils/data/data_utils.py`
- Delete the unused `process_rollout_data` helper and the imports it required
# 🐛 Bug Fix

## Keep multimodal prompt building non-destructive

- Build multimodal message content without mutating cached prompt rows
- Prevent reused raw samples from carrying expanded message content into later reads

## Support sliced eager dataset paths

- Parse per-file generalized slice syntax in eager file readers
- Keep multi-file eager path behavior aligned with streaming path semantics
The rollout component exits its main loop on the final training step, leaving the eval handler un-awaited. This caused a race condition where the controller's atexit shutdown tore down SGLang engines mid-flight. This fix blocks until the evaluation finishes at the end of training.
# ⭐ Feature

## Migrate from Megatron-LM to Megatron-Bridge

- Replace direct Megatron-LM checkout with Megatron-Bridge (commit 2faedbf6) in Dockerfile
- Upgrade transformer_engine from 2.10.0 to 2.14.1
- Archive old megatron patch (3714d81d) and add new patch for 20260506-85bced0ae

## Adapt Relax backend to Megatron-Bridge API changes

- Update vocab_size_with_padding import with fallback for new module path
- Rename enable_gloo_process_groups to use_gloo_process_groups
- Rename norm_epsilon to layernorm_epsilon in HF config validation
- Accept **kwargs in wrapped_provider for new model_provider signature
- Relax partition_stride assertion for GLU/SwiGLU linear_fc1 layers (stride=2)
- Guard checkpoint_write_patch against removed write_preloaded_data_multiproc
# ⭐ Feature

## Add Qwen3.6 model support with automatic expert format detection

- Add Qwen3.6-35B-A3B model configuration script with MoE parameters (256 experts, 8-way routing)
- Implement MTP MoE expert weight format detection in Qwen35VL bridge
  - Qwen3.5: per-expert storage (gate_proj/up_proj/down_proj per expert)
  - Qwen3.6: packed format (gate_up_proj/down_proj shared tensor)
- Add training script for Qwen3.6-35B-A3B 8xGPU colocate mode with multimodal support
- Extend Megatron bridge patch with format-aware weight mappings

---

# 🐛 Bug Fix

## Fix multimodal data counting and training script paths

- Fix remain_data counter for pre-structured multimodal content (was skipping already-processed items)
- Remove invalid dataset slice notation (@[0:1000]) from PROMPT_SET path in training script
# ⭐ Feature

## Add rollout reward field metrics

- Aggregate numeric fields from reward dictionaries during rollout logging
- Skip the primary reward key and raw_reward to preserve existing reward metrics
- Reuse the shared helper from the SGLang rollout metrics path
# 🐛 Bug Fix

## Strip consecutive `<|image_pad|>` tokens in pre-tokenized prompts

- Add `QwenVLImageProcessor._strip_image_token` static helper that collapses
  runs of `<|image_pad|>` (token id 151655) into a single placeholder while
  leaving the surrounding `<|vision_start|>`/`<|vision_end|>` markers intact.
- Apply the helper to `prompt` before calling `load_mm_data` in
  `process_mm_data_async`, so pre-tokenized `input_ids` (where each image is
  already expanded to N image-pad tokens, one per visual patch) no longer
  collide with `load_mm_data` re-expanding the placeholder itself. Without
  the collapse, the pipeline saw `N x M` image-pad tokens and miscounted
  positions, breaking mrope bookkeeping.
- Raw text (`str`) prompts are passed through unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants