WIP. Proof of concept for GPU accelerated genArea#44
WIP. Proof of concept for GPU accelerated genArea#44hukumka wants to merge 2 commits intoCubitect:masterfrom
Conversation
This implementation is a proof of concept, and missing: + Layers past L_SHORE_16 + Support for different minecraft versions
|
Thanks for the interest, I was always a little sceptical about performance with a GPU. Generating giant areas in one go might work reasonably well on a GPU, but the code is highly reliant on branching, which is like poison to a GPU and to SSE instructions. Also I find myself needing small areas much more often than large ones, which make this problem much worse. So I always leaned towards distributing workload on CPU cores instead. That said I'm quite interested to see what the performance would actually be using a GPU in different scenarios. While checking out the your branch I found a bug in the cubiomes library that caused I found a couple of issues with the draft. I think at |
| out[xx + 1 + zz * w] = (cs >> 24) & 1 ? v10 : v00; | ||
| } | ||
| int v; | ||
| if (v10 == v01 && v01 == v11) v = v10; |
There was a problem hiding this comment.
I did a few experiments one day trying to remove this branches, which are from select_mode_or_random. This is the alternative that worked better to me, why is about 25% faster that the "if cascade" on my CPU. I hope that the difference is bigger on a GPU but can't try that myself:
| if (v10 == v01 && v01 == v11) v = v10; | |
| int cv00 = (v00 == v10) + (v00 == v01) + (v00 == v11); | |
| int cv10 = (v10 == v01) + (v10 == v11); | |
| int cv01 = v01 == v11; | |
| if cv00 > cv10 && cv00 > cv01 { | |
| v = v00; | |
| } else if cv10 > cv00 { | |
| v = v10; | |
| } else if cv01 > cv00 { | |
| v = v01; | |
| } else { | |
| // v = random | |
| } |
There was a problem hiding this comment.
This looks great! I see you did a lot of testing and the assembly does look significantly better, if only for the CPU. I did some rudimentary testing with CUDA C, and I was surprised that the improvement was only minor for a GPU. After some digging I found that the nvcc compiler manages to reduce the branching for this part of the device code quite well on its own (at least better than gcc).
Hello, and thanks for this awesome library.
This PR is a step toward #18 and implements generation of areas using opencl.
Lacking features
Performance
Then generating 64 seeds per routine, I observed x30 speedup.
Then generating 1 seed per routine, speedup is only x5.
Terribly sorry for dumping such a large chunk of code in a single PR, but I needed to see if
my approach for avoiding recomputing same layer multiple times works before I submitted this.