Skip to content

Fix nvptx Kernel return type lowering#771

Open
BI71317 wants to merge 3 commits intoexaloop:developfrom
BI71317:develop
Open

Fix nvptx Kernel return type lowering#771
BI71317 wants to merge 3 commits intoexaloop:developfrom
BI71317:develop

Conversation

@BI71317
Copy link

@BI71317 BI71317 commented Mar 9, 2026

Fixes #770

Changes

  • detect NVPTX kernel entry functions during lowering
  • emit an explicit void return type for final kernel entry functions
  • preserve existing higher-level semantics before final kernel lowering

Validation

  • verified that NVPTX kernel entry functions are now emitted with explicit void return types
  • verified that the resulting kernels still compile and execute successfully

Observed IR

define dso_local void @hello_0_0_std_internal_types_array_List_0_int__std_internal_types_array_List_0_int__std_internal_types_array_List_0_int__(ptr nocapture readonly %0, ptr nocapture readonly %1, ptr nocapture readonly %2) local_unnamed_addr #1 {
...
  ret void
}
...

@BI71317 BI71317 requested a review from inumanag as a code owner March 9, 2026 01:01
@cla-bot
Copy link

cla-bot bot commented Mar 9, 2026

Thank you for your pull request. We require contributors to agree to our Contributor License Agreement (https://exaloop.io/legal/cla), and we don't have @BI71317 on file. In order for us to review and merge your code, please email info@exaloop.io to get yourself added.

@BI71317
Copy link
Author

BI71317 commented Mar 9, 2026

I’ve already signed the CLA and just sent a quick note to info@exaloop.io with my GitHub username and PR link 🙂

@arshajii
Copy link
Contributor

arshajii commented Mar 9, 2026

@cla-bot recheck

@arshajii
Copy link
Contributor

arshajii commented Mar 9, 2026

Thanks for the PR! I think the better way to do this would be at the LLVM level during NVPTX codegen (see gpu.cpp). Specifically, we can have an LLVM transformation that converts kernel return types to void and updates ret instructions appropriately.

Currently the Codon IR void type is basically unused and might be phased out in the near future, so best to avoid it if possible. Happy to suggest specific changes / help out with this change as needed.

@BI71317 BI71317 requested a review from arshajii as a code owner March 10, 2026 02:46
@cla-bot cla-bot bot added the cla-signed label Mar 10, 2026
@BI71317
Copy link
Author

BI71317 commented Mar 10, 2026

Got it, thanks for the guidance. I’ve reworked the change so that instead of converting kernel return types to void at the high-level Codon IR stage, the IR is now rewritten in gpu.cpp during applyGPUTransformations before being passed to NVPTX codegen.

Minimal Reproducer

import numpy as np
import gpu
a = np.arange(16)
b = np.arange(16) * 2
c = np.empty(16, dtype=int)

@gpu.kernel 
def vadd(a, b, c, n):
    i = gpu.thread.x
    # i = ocl.thread.x
    if i < n:
        c[i] = a[i] + b[i]

vadd(a, b, c, 16, grid=1, block=16)
print(a)
print(b)
print(c)

Observed IR

; ModuleID = 'codon'
source_filename = "/home/swchoi/src/test_code/codon_gpu_programming/vadd_np.codon"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare noundef i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(readwrite, inaccessiblemem: write)
define dso_local void @"vadd.0:0[std.numpy.ndarray.ndarray.0[int,1],std.numpy.ndarray.ndarray.0[int,1],std.numpy.ndarray.ndarray.0[int,1],int]"({ { i64 }, { i64 }, ptr } %0, { { i64 }, { i64 }, ptr } %1, { { i64 }, { i64 }, ptr } %2, i64 %3) local_unnamed_addr #1 {
entry:
  %res.i.i = tail call range(i32 0, 1024) i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %4 = zext nneg i32 %res.i.i to i64
  %tmp.i32 = icmp sgt i64 %3, %4
  br i1 %tmp.i32, label %if.true, label %if.exit

if.true:                                          ; preds = %entry
  %.fca.2.extract.i.i.i = extractvalue { { i64 }, { i64 }, ptr } %0, 2
  %5 = extractvalue { { i64 }, { i64 }, ptr } %0, 0
  %6 = extractvalue { { i64 }, { i64 }, ptr } %0, 1
  %.fca.0.extract.i.i.i.i = extractvalue { i64 } %5, 0
  %tmp.i18.not.i = icmp sgt i64 %.fca.0.extract.i.i.i.i, %4
  tail call void @llvm.assume(i1 %tmp.i18.not.i)
  %.fca.0.extract.i99.i.i.i = extractvalue { i64 } %6, 0
  %tmp.i14.i.i.i.i = mul i64 %.fca.0.extract.i99.i.i.i, %4
  %7 = getelementptr i8, ptr %.fca.2.extract.i.i.i, i64 %tmp.i14.i.i.i.i
  %8 = load i64, ptr %7, align 4
  %.fca.2.extract.i.i.i33 = extractvalue { { i64 }, { i64 }, ptr } %1, 2
  %9 = extractvalue { { i64 }, { i64 }, ptr } %1, 0
  %10 = extractvalue { { i64 }, { i64 }, ptr } %1, 1
  %.fca.0.extract.i.i.i.i34 = extractvalue { i64 } %9, 0
  %tmp.i18.not.i1 = icmp sgt i64 %.fca.0.extract.i.i.i.i34, %4
  tail call void @llvm.assume(i1 %tmp.i18.not.i1)
  %.fca.0.extract.i99.i.i.i35 = extractvalue { i64 } %10, 0
  %tmp.i14.i.i.i.i36 = mul i64 %.fca.0.extract.i99.i.i.i35, %4
  %11 = getelementptr i8, ptr %.fca.2.extract.i.i.i33, i64 %tmp.i14.i.i.i.i36
  %12 = load i64, ptr %11, align 4
  %tmp.i = add i64 %12, %8
  %.fca.2.extract109.i.i.i = extractvalue { { i64 }, { i64 }, ptr } %2, 2
  %13 = extractvalue { { i64 }, { i64 }, ptr } %2, 0
  %14 = extractvalue { { i64 }, { i64 }, ptr } %2, 1
  %.fca.0.extract.i.i.i.i37 = extractvalue { i64 } %13, 0
  %tmp.i18.not.i2 = icmp sgt i64 %.fca.0.extract.i.i.i.i37, %4
  tail call void @llvm.assume(i1 %tmp.i18.not.i2)
  %.fca.0.extract.i141.i.i.i = extractvalue { i64 } %14, 0
  %tmp.i14.i.i.i.i38 = mul i64 %.fca.0.extract.i141.i.i.i, %4
  %15 = getelementptr i8, ptr %.fca.2.extract109.i.i.i, i64 %tmp.i14.i.i.i.i38
  store i64 %tmp.i, ptr %15, align 4
  br label %if.exit

if.exit:                                          ; preds = %if.true, %entry
  ret void
}

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: write)
declare void @llvm.assume(i1 noundef) #2

attributes #0 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #1 = { mustprogress nofree norecurse nosync nounwind willreturn memory(readwrite, inaccessiblemem: write) "frame-pointer"="none" "kernel" "target-cpu"="meteorlake" "target-features"="+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,-amx-fp8,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,-avx512ifma,+xsave,+sse4.2,-tsxldtrk,-sm3,-ptwrite,-widekl,-movrs,+invpcid,+64bit,+xsavec,-avx10.1-512,-avx512vpopcntdq,+cmov,-avx512vp2intersect,-avx512cd,+movbe,-avxvnniint8,-ccmp,-amx-int8,-kl,-avx10.1-256,-sha512,+avxvnni,-rtm,+adx,+avx2,-hreset,+movdiri,+serialize,+vpclmulqdq,-avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,-avx10.2-256,+gfni,-avxvnniint16,-amx-fp16,-zu,-ndd,+xsaveopt,+rdrnd,-avx512f,-amx-bf16,-avx512bf16,-avx512vnni,-push2pop2,+cx8,-avx512bw,+sse3,-pku,-nf,-amx-tf32,-amx-avx512,+fsgsbase,-clzero,-mwaitx,-lwp,+lzcnt,+sha,+movdir64b,-ppx,-wbnoinvd,-enqcmd,-amx-transpose,-avx10.2-512,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,-avx512bitalg,-rdpru,+clwb,+mmx,+sse2,+rdseed,-avx512vbmi2,-prefetchi,-amx-movrs,+rdpid,-fma4,-avx512vbmi,+shstk,+vaes,+waitpkg,-sgx,+fxsr,-avx512dq,-sse4a" }
attributes #2 = { nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: write) }

!llvm.module.flags = !{!0}
!nvvm.annotations = !{!1}
!llvm.ident = !{!2}
!nvvmir.version = !{!3}

!0 = !{i32 2, !"Debug Info Version", i32 3}
!1 = !{ptr @"vadd.0:0[std.numpy.ndarray.ndarray.0[int,1],std.numpy.ndarray.ndarray.0[int,1],std.numpy.ndarray.ndarray.0[int,1],int]", !"kernel", i32 1}
!2 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
!3 = !{i32 2, i32 0}

@BI71317
Copy link
Author

BI71317 commented Mar 17, 2026

Hi! I left an update on this PR a little while ago, but it may have slipped through the cracks.

I reworked the implementation based on the earlier feedback. Would appreciate another look whenever someone has time. Thanks!

Copy link
Contributor

@arshajii arshajii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few minor review comments. LGTM overall!

return std::vector<llvm::GlobalValue *>(keep.begin(), keep.end());
}

static bool isEmptyStructType(llvm::Type *ty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need static if these are in namespace {}.


static bool isEmptyStructType(llvm::Type *ty) {
auto *st = llvm::dyn_cast<llvm::StructType>(ty);
return st && st->getNumElements() == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should also check !st->hasName().

}

static llvm::Function *normalizeKernelReturnToVoid(llvm::Function *F) {
if (!F || F->isDeclaration())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge these conditions via || into a single if statement?

std::vector<llvm::Function *> kernelCandidates;
std::vector<llvm::GlobalValue *> kernels;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format with clang-format --style=file -i codon/cir/llvm/gpu.cpp, which should remove trailing whitespace.

return st && st->getNumElements() == 0;
}

static llvm::Function *normalizeKernelReturnToVoid(llvm::Function *F) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, don't need static here.

@BI71317
Copy link
Author

BI71317 commented Mar 24, 2026

Thanks for the review and suggestions. I’ve addressed the requested changes and pushed a new commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix NVPTX kernel entry return type lowering

2 participants