So, my Photon Mapping experiments led me from CUDA to OpenCL (for better compatibility) and now to Vulkan for even better compatiblity. OpenCL is just not cutting it anymore: abandoned by Apple, ignored by AMD. The latter I found out when I tried to use OpenCL on a mobile Ryzen iGPU under Linux: no go!
So Vulkan it is. I have always found Vulkan to be intimidating. It is just too much, before you will ever see a single triangle on your screen. I am hopeful that Vulkan Compute is less cumbersome, and it seems to be that way.
You still needs heaps and heaps of configuration and setup code, though. And I am slowly making my way through that. You can follow my progress at MinimalVulkanCompute github.
Some other sources for inspiration are Sascha Willems' example and Neil Henning's example
Vulkan uses SPIR-V modules to represent the compute kernels. So I need to port my OpenCL or CUDA code to this. I think I can just transpile my OpenCL code into SPIR-V using Google's clspv project.
After building clspv it appears that it ingests my OpenCL code pretty well, as it manages to create SPIR-V output. So far so good. I still need to execute it in my Vk code. I wonder how much performance is lost by the intermediate step, compared to a native OpenCL driver?
Next order of business is to figure out if I should create staging buffers. For an iGPU, all memory is the same. But for a dGPU, I have a lot of options for which type of memory to use. For instance, using my RX580 Radeon dGPU, I see:
$ ./minimal_vulkan_compute Found 3 physical devices. 8086:4680 iGPU Intel(R) UHD Graphics 770 (ADL-S GT1) 1002:67df dGPU AMD Radeon RX 580 Series (RADV POLARIS10) 10005:0000 CPU llvmpipe (LLVM 15.0.7, 256 bits) 7 mem types. 2 mem heaps. 4096 MiB of local memory [ device-local ] 4096 MiB of local memory [ device-local ] 15908 MiB of non-local memory [ host-visible host-coherent ] 4096 MiB of local memory [ device-local host-visible host-coherent ] 4096 MiB of local memory [ device-local host-visible host-coherent ] 15908 MiB of non-local memory [ host-visible host-coherent host-cached ] 15908 MiB of non-local memory [ host-visible host-coherent host-cached ]
UPDATE
I have the vulkan port finally working as it should. Some things that tripped me up: A Vulkan kernel gets the work group size from the SPIRV. Whereas an OpenCL kernel can just set it in the clEnqueueNDRangeKernel() call, at client side. This makes the clspv route a little tricky: we need to spec the workgroup size in the CL kernel source, using __attribute__((reqd_work_group_size(x, y, z)))
The end result is that the CL -> clspv -> SPIRV -> Vulkan overhead takes quite a bit of performance away. The overhead is tolerable on nvidia RTX, but it is too much on Intel:
NVIDIA GeForce RTX 3060 Laptop GPU NVIDIA Corporation with [30 units] OPENCL: rayt: 2953 µs boun: 2989 µs binp: 1458 µs CLSPV+VULKAN: rayt: 3543 μs boun: 3586 μs binp: 1224 μs
Intel(R) UHD Graphics 770 (ADL-S GT1) OPENCL: rayt: 22017 µs boun: 20927 µs binp: 7310 µs CLSPV+VULKAN: rayt: 44635 μs boun: 40133 μs binp: 8490 μs
CONCLUSION
Depending on the platform, my compute time goes up between +20% and +100% when using transpiled OpenCL kernels via clspv. I should also mention that I have found the performance difference between CUDA and OpenCL insignificant.