Conversation
Now, `ceilfracf` does what its name suggests. Incidentally, it is no longer needed. `cuda_kernel_softMaskBackgroundValue`, `cuda_kernel_cosineFilter`, and `cuda_kernel_softMaskOutsideMap` now do proper grid-stride iteration.
|
Hi James, The ceilfracf function is only used in index management. The cases where there is no remainder in the division are unlikely to occur and even if they do they only add a tiny performance penalty. The numeical results are unaffected. However, making the suggested change would add a performance penalty that would always occur due to the added floating point division in addition to a reduced numerical accuracy for large numbers. That function is a common way of doing index management in kernels for this reason. |
|
Thanks Dari, On review, I must concede that |
Hello Sjors, Dari, & co.!
I have come across what I believe to be a bug that affects pixel sampling in
cuda_kernel_softMaskBackgroundValueandcuda_kernel_cosineFilter. These functions make use ofceilfracf:relion/src/acc/cuda/cuda_kernels/cuda_device_utils.cuh
Line 42 in dcab793
So far as I can tell from its name and the places it is called from,
ceilfracf(a, b)should return the least integernsuch thatn * b >= a. But, as is clear from its definition, it does not do that. In cases wherebdividesawith no remainder,the present implementation will return the intended result + 1.
We want to return not
a / b + 1butceilf(float(a) / float(b)). Or, without casting tofloat:a / b + bool(a % b)(assumingT1andT2are integral types, which is true at all the present call sites).Now, why does this matter? As I said,
ceilfracfis called from a handful of functions, includingcuda_kernel_softMaskBackgroundValueandcuda_kernel_cosineFilter. In these places, it is being used to calculate the number of strides needed to iterate over some image data, given some number of parallel CUDA threads. In the pathological case, when the image size is divisible by the number of threads, there will be too great a gap between where the threads start, and pixels will be missed. For instance, given a 32 × 32 × 32 image, callingcuda_kernel_softMaskBackgroundValuewith 128 threads per block and 128 blocks per grid (as is currently done), ~10k out of ~30k pixels will be ignored. The situation is less disastrous for larger images, since asa(the image size) increases, the relative error inceilfracf(a, b)decreases. Given a 64 × 64 × 64 image, "only" ~15k out of ~260k (6%) get missed.So, the most obvious fix is to change
ceilfracfin the manner described above.This brings me to my next point. Do we even need
ceilfracf? It took me some time to convince myself that (whenceilfracfdoes what it should)cuda_kernel_softMaskBackgroundValuesamples each pixel in the image exactly once. As it stands now, threads iterate overvolin block-sized strides. Each thread block passes over a different subspan ofvol,and each thread within a block samples that subspan once every
SOFTMASK_BLOCK_SIZEpixels. The loop that controls this iteration increments two things:texeland a separate counterpass, which is unused in the body of the loop.The loop checks two conditions on every iteration: whether
texelhas gone past the end ofvol, and whetherpasshas gone pasttexel_pass_num(which, as I have explained, will sometimes be off by 1). There need only be one iterator to increment.cuda_kernel_softMaskBackgroundValuehas been like this since its inception in 2016. It looks to me like the intention is to do a grid-stride loop. So, that is what I have tried to implement. I have taken the opportunity to get rid of the shared-memory arrayimg_pixels, for which I saw no use, and to introduce a closureweight, defined outside the loop body and invoked within it. Now, there is not even any need forceilfracf. We can dispense with it. I have similarly rewrittencuda_kernel_softMaskOutsideMapandcuda_kernel_cosineFilter. There are other functions that make use ofceilfracf, but they only do block-stride iteration. I think they are safe.Best,
James