Skip to content

Conversation

@davebayer
Copy link
Contributor

@davebayer davebayer commented Nov 14, 2025

Currently, we have 2 types of CUDA Driver calls:

  1. __meow(...) and
  2. __meowNoThrow(...)

I don't like the design, it forces us to repeat the implementations and forces us to have in/out parameters.

This PR refactors the driver APIs and keeps only 1 definition for each API. Each function is noexcept and returns std::expected-like type that returns the error + return values.

In case that throwing behaviour is demanded, _CCCL_TRY_DRIVER_API macro is provided, which checks the return value of the CUDA call and returns the output parameters. Alternatively, _CCCL_ASSERT_DRIVER_API can be used to assert that a call succeeded.

@davebayer davebayer requested review from a team as code owners November 14, 2025 16:04
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 14, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 14, 2025
@davebayer davebayer force-pushed the refactor_driver_api branch 2 times, most recently from 5285c51 to f19d5e6 Compare November 14, 2025 16:18
{
static auto __driver_fn = _CCCLRT_GET_DRIVER_FUNCTION(cuDeviceGetCount);
int __result;
::cuda::__driver::__call_driver_fn(__driver_fn, "Failed to get device count", &__result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this implementation, we are loosing the API name and error messages. I am wondering whether we could just add a __api_ and __error_msg_ members to the __call_result structure and use it in __check_and_extract_result function, it should be really easy for the compiler to optimize it out

@fbusato
Copy link
Contributor

fbusato commented Nov 14, 2025

ok, now I'm confused. In another PR you suggested me to avoid std::span and use raw arrays to keep the implementation minimal. In this PR, we are introducing std::expected which is way more complex.

@leofang
Copy link
Member

leofang commented Nov 14, 2025

Non-blocking comments:

  • I discussed with @pciolkosz a while back about this, my opinion is that the entire driver_api.h should just be auto-generated. We can already do this same thing for cuda-bindings (here), so this is trivial, and we have a plan to expand it further
  • it has occurred to me that _CCCLRT_GET_DRIVER_FUNCTION does not cache the retrieved function pointers. There will be a small performance penalty. (It's the reason that we cached them in cuda-bindings.)

@davebayer
Copy link
Contributor Author

ok, now I'm confused. In another PR you suggested me to avoid std::span and use raw arrays to keep the implementation minimal. In this PR, we are introducing std::expected which is way more complex.

... returns std::expected-like type ...

I discussed with @pciolkosz a while back about this, my opinion is that the entire driver_api.h should just be auto-generated.

I don't think the file to auto generate the header would contain less code, because we still want to adapt the API to our needs.

it has occurred to me that _CCCLRT_GET_DRIVER_FUNCTION does not cache the retrieved function pointers

We store the function addresses in static members inside the functions, _CCCLRT_GET_DRIVER_FUNCTION is called only once per program launch.

@leofang
Copy link
Member

leofang commented Nov 14, 2025

I don't think the file to auto generate the header would contain less code, because we still want to adapt the API to our needs.

It is about avoiding hand-writing driver_api.h in a future-proof, safer fashion. The call sites still needs to be hand-written.

Hand-writing bindings are error prone and not scalable as we expand the project scope and coverage.

We store the function addresses in static members inside the functions, _CCCLRT_GET_DRIVER_FUNCTION is called only once per program launch.

Function-local static variables are what I missed, thanks!

@github-actions

This comment has been minimized.

Comment on lines +323 to +333
const auto __needs_load = ::cuda::__driver::__version_at_least(12, 4);
if (__needs_load.__error_ != ::CUDA_SUCCESS)
{
return static_cast<::cudaError_t>(__needs_load.__error_);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these not _CCCL_TRY_DRIVER_API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a question for @pciolkosz, all of the functions applying the launch options to a given kernel launch are marked as noexcept

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a leftover consequence of an early design, where I had only the top level interface throwing and internal function always noexcept in case we wanted to implement a exception-free alternative top level interface. But we decided its not a goal anymore, so we could just throw here


template <class _Tp>
[[nodiscard]] _CCCL_HOST_API constexpr _Tp __check_and_extract_result(
__call_result<_Tp> __result, ::cuda::std::source_location __loc = ::cuda::std::source_location::current())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question should this take by

Suggested change
__call_result<_Tp> __result, ::cuda::std::source_location __loc = ::cuda::std::source_location::current())
__call_result<_Tp>&& __result, ::cuda::std::source_location __loc = ::cuda::std::source_location::current())

and forward_like below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the types returned by the driver APIs are trivial, so I don't think it makes any difference

{
::CUresult __error_;
_Tp __value_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is missing at least one feature

Suggested change
};
_CCCL_API constexpr operator bool() const noexcept {
return __error_ == ::CUDA_SUCCESS;
}
};

We might also want to think about at least one monadic interface:

Suggested change
};
_CCCL_API constexpr _Tp or_else(_Tp __alternative) const noexcept {
if (__error_ == ::CUDA_SUCCESS) {
return __value_;
}
else
{
return __alternative;
}
}
};

This would be handy if there is a fallible API with a sensible default value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. I don't think the operator bool does make things more clean. Examples from the code:

  1. Not ideally named variables
const auto __needs_load = _CCCL_TRY_DRIVER_API(__version_at_least(12, 4));
if (!__needs_load)
{
  return __needs_load.__error_;
}
  1. What's actually being checked here?
const auto __mem_type = ::cuda::__driver::__pointerGetAttribute<::CU_POINTER_ATTRIBUTE_MEMORY_TYPE>(__p1);
return __mem_type || __mem_type.__value_ == ::CU_MEMORYTYPE_HOST;

I expect us to use the macros in 99% of the time. I would suggest just using the result.__error_ and result.__value_ members directly for other cases.

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Nov 17, 2025
: stream_ref(::cuda::__invalid_stream())
{
[[maybe_unused]] __ensure_current_context __ctx_setter(__dev);
__stream = ::cuda::__driver::__streamCreateWithPriority(cudaStreamNonBlocking, __priority);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor concern. This changes the original behavior of the driver call wrt. exceptions. A developer could still think that direct driver calls throw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. On the other a developer using cuda::__driver::__meow(...) call in a noexcept function thinking it just does what cuMeow(...) would do is even worse in my opinion

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 10m: Pass: 100%/120 | Total: 1d 15h | Max: 1h 25m | Hits: 91%/231199

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

5 participants