Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

To avoid error on hardware allows a small number of threads per sm

@liqiangxl liqiangxl added the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Jan 13, 2026
@liqiangxl
Copy link
Collaborator Author

!test

@liqiangxl liqiangxl requested a review from rdspring1 January 13, 2026 17:29
@github-actions
Copy link

github-actions bot commented Jan 13, 2026

Auto-merge Status

✅ Internal CI is finished
❌ No failed checks (nvfuser-ci/smoke_test_wheel_nightly_20)
✅ PR is mergeable
ℹ️ PR mergeable_state: unstable

Description

  • Add hardware constraint validation for threads per SM calculation

  • Use std::min to limit threads by register usage and maxThreadsPerMultiProcessor

  • Prevent potential errors on hardware with limited threads per SM

  • Add explanatory comment for the constraint check

Changes walkthrough

Relevant files
Bug fix
normalization_inner_outer_multi_wave.cpp
Add hardware constraint validation for threads per SM       

csrc/scheduler/normalization_inner_outer_multi_wave.cpp

  • Modified threads_per_sm calculation to include hardware constraint
    check
  • Added std::min comparison between register-based calculation and
    maxThreadsPerMultiProcessor
  • Added explanatory comment documenting the hardware constraint
    limitation
  • Prevents errors on hardware with limited threads per SM
  • +4/-1     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review
    Missing Tests

    The PR adds important hardware constraint checking but lacks corresponding tests to validate the behavior. Tests should verify that the threads per SM calculation correctly respects both register-based constraints and hardware limits, especially on hardware with low thread capacity.

    int64_t threads_per_sm = std::min(
        getThreadsPerSMGivenRegPerThread(reg_per_thread),
        static_cast<int64_t>(dev_prop->maxThreadsPerMultiProcessor));
    Performance Impact

    While the hardware constraint check is necessary for correctness, it may impact performance on hardware with conservative thread limits. The PR should include performance data to quantify any potential regressions and ensure the change doesn't significantly degrade performance on supported hardware.

    int64_t threads_per_sm = std::min(
        getThreadsPerSMGivenRegPerThread(reg_per_thread),
        static_cast<int64_t>(dev_prop->maxThreadsPerMultiProcessor));

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 13, 2026

    Greptile Overview

    Greptile Summary

    Added std::min() constraint to limit threads_per_sm by maxThreadsPerMultiProcessor hardware property when calculating grid dimensions.

    • Previously, getThreadsPerSMGivenRegPerThread() could return a value exceeding hardware limits on devices with constrained thread capacity
    • The fix prevents runtime errors by capping threads per SM to the hardware maximum
    • This is a defensive fix that protects against edge cases on hardware with limited threading capabilities

    Confidence Score: 5/5

    • This PR is safe to merge - it adds a defensive constraint that prevents hardware limit violations
    • The change is minimal, well-targeted, and adds a safety check without altering core logic. The std::min() constraint ensures calculated thread counts never exceed hardware capabilities, preventing potential runtime errors on constrained devices
    • No files require special attention

    Important Files Changed

    File Analysis

    Filename Score Overview
    csrc/scheduler/normalization_inner_outer_multi_wave.cpp 5/5 Added hardware constraint check to prevent exceeding maxThreadsPerMultiProcessor when calculating threads per SM

    Sequence Diagram

    sequenceDiagram
        participant Caller as getHeuristics()
        participant Lambda as getGdimy()
        participant Utils as getThreadsPerSMGivenRegPerThread()
        participant DevProp as CUDA Device Properties
        participant BlockCalc as getBlocksPerSM()
        
        Caller->>Lambda: Call with inner_vect, threads_per_block, inner_batch
        Lambda->>Lambda: Calculate reg_per_thread from register usage
        Lambda->>Utils: getThreadsPerSMGivenRegPerThread(reg_per_thread)
        Utils->>DevProp: Query register allocation properties
        Utils-->>Lambda: Return threads_per_sm (based on registers)
        Lambda->>DevProp: Query maxThreadsPerMultiProcessor
        DevProp-->>Lambda: Return hardware limit
        Lambda->>Lambda: threads_per_sm = min(register_based, hardware_limit)
        Note over Lambda: NEW: Prevents exceeding hardware constraints
        Lambda->>BlockCalc: getBlocksPerSM(threads_per_sm, threads_per_block, warpSize)
        BlockCalc-->>Lambda: Return blocks_per_sm
        Lambda->>Lambda: Calculate gdimy = blocks_per_sm * multiprocessor_count
        Lambda->>Lambda: Apply outer_iter_min constraint
        Lambda-->>Caller: Return final gdimy value
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    No files reviewed, no comments

    Edit Code Review Agent Settings | Greptile

    @liqiangxl liqiangxl merged commit 4b31086 into main Jan 14, 2026
    63 checks passed
    @liqiangxl liqiangxl deleted the llu/threads_per_sm branch January 14, 2026 01:46
    @github-actions github-actions bot removed the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Jan 14, 2026
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants