Skip to content

Conversation

@YanhuiDua
Copy link
Collaborator

@YanhuiDua YanhuiDua commented Nov 13, 2025

增加http server异常处理逻辑:

  1. timeout error / request error: retry
  2. client error: 跳过这条数据
  3. server error: 由controller控制计数每个worker出错的次数,超过global_batch_size * 0.1, 将该worker标记为deactivate
  4. unknown error: raise RuntimeError

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors HTTP request error handling in the rollout system to provide more granular error management and improve system reliability. The refactoring introduces a new error classification system that categorizes HTTP errors and handles them according to different strategies:

  • Timeout/Request errors: Retry the request
  • Client errors (4xx): Skip the data item
  • Server errors (5xx): Track failures per worker and deactivate workers exceeding threshold
  • Unknown errors: Raise RuntimeError for immediate attention

Key changes include:

  • New httpx_utils.py module with error type enum and result dataclass
  • Refactored error handling in worker.py to use structured error results
  • Worker deactivation logic in controller based on repeated server errors
  • Mock test utilities and comprehensive test suite for error scenarios

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
xtuner/v1/utils/httpx_utils.py New utility module defining HTTP error types, result dataclass, and status setting logic
xtuner/v1/utils/rl_test_utils.py Added mock worker classes for simulating different HTTP error scenarios in tests
xtuner/v1/ray/rollout/worker.py Refactored _safe_post_request and rollout_task methods to use structured error handling
xtuner/v1/ray/rollout/controller.py Added worker deactivation logic based on server error counts; moved @ray.remote decorator to call sites
xtuner/v1/ray/rollout/lmdeploy.py Removed @ray.remote decorator to allow dynamic application
xtuner/v1/ray/environment/base_env.py Added support for injecting pre-initialized controllers for testing
xtuner/v1/ray/environment/single_turn_env.py Enhanced environment initialization to support controller injection and handle skipped data
xtuner/v1/ray/dataflow/flow.py Added skipped sample tracking and updated termination conditions
xtuner/v1/ray/config/worker.py Added max_retry_per_worker configuration with automatic default calculation
xtuner/v1/data_proto/rl_data.py Updated check_dataflow_item to handle skipped status; added comment documenting finish_reason values
xtuner/v1/ray/evaluator.py Improved logging to use action_id instead of data object
tests/ray/test_rollout.py Updated tests to use dynamic ray.remote decorator; removed outdated error test
tests/ray/test_mock_rollout.py New comprehensive test suite for error handling scenarios
tests/ray/test_evaluator.py Removed outdated error handling test
Comments suppressed due to low confidence (1)

xtuner/v1/ray/environment/base_env.py:140

  • This import of module ray is redundant, as it was previously imported on line 4.
        import ray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

xtuner/v1/ray/environment/base_env.py:140

  • This import of module ray is redundant, as it was previously imported on line 4.
        import ray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

xtuner/v1/ray/environment/base_env.py:140

  • This import of module ray is redundant, as it was previously imported on line 4.
        import ray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@YanhuiDua YanhuiDua requested a review from Copilot November 17, 2025 13:08
Copilot finished reviewing on behalf of YanhuiDua November 17, 2025 13:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

xtuner/v1/ray/environment/base_env.py:140

  • This import of module ray is redundant, as it was previously imported on line 4.
        import ray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

xtuner/v1/ray/environment/base_env.py:145

  • This import of module ray is redundant, as it was previously imported on line 5.
        import ray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@YanhuiDua YanhuiDua merged commit ecb3d66 into InternLM:main Nov 19, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants