Saturday, September 9, 2023

Vulkan Compute

So, my Photon Mapping experiments led me from CUDA to OpenCL (for better compatibility) and now to Vulkan for even better compatiblity. OpenCL is just not cutting it anymore: abandoned by Apple, ignored by AMD. The latter I found out when I tried to use OpenCL on a mobile Ryzen iGPU under Linux: no go!

So Vulkan it is. I have always found Vulkan to be intimidating. It is just too much, before you will ever see a single triangle on your screen. I am hopeful that Vulkan Compute is less cumbersome, and it seems to be that way.

You still needs heaps and heaps of configuration and setup code, though. And I am slowly making my way through that. You can follow my progress at MinimalVulkanCompute github.

Some other sources for inspiration are Sascha Willems' example and Neil Henning's example

Vulkan uses SPIR-V modules to represent the compute kernels. So I need to port my OpenCL or CUDA code to this. I think I can just transpile my OpenCL code into SPIR-V using Google's clspv project.

After building clspv it appears that it ingests my OpenCL code pretty well, as it manages to create SPIR-V output. So far so good. I still need to execute it in my Vk code. I wonder how much performance is lost by the intermediate step, compared to a native OpenCL driver?

Next order of business is to figure out if I should create staging buffers. For an iGPU, all memory is the same. But for a dGPU, I have a lot of options for which type of memory to use. For instance, using my RX580 Radeon dGPU, I see:

$ ./minimal_vulkan_compute 
Found 3 physical devices.
8086:4680 iGPU Intel(R) UHD Graphics 770 (ADL-S GT1)
1002:67df dGPU AMD Radeon RX 580 Series (RADV POLARIS10)
10005:0000 CPU llvmpipe (LLVM 15.0.7, 256 bits)
7 mem types. 2 mem heaps.
4096 MiB of local memory [ device-local ]
4096 MiB of local memory [ device-local ]
15908 MiB of non-local memory [ host-visible host-coherent ]
4096 MiB of local memory [ device-local host-visible host-coherent ]
4096 MiB of local memory [ device-local host-visible host-coherent ]
15908 MiB of non-local memory [ host-visible host-coherent host-cached ]
15908 MiB of non-local memory [ host-visible host-coherent host-cached ]
  

UPDATE

I have the vulkan port finally working as it should. Some things that tripped me up: A Vulkan kernel gets the work group size from the SPIRV. Whereas an OpenCL kernel can just set it in the clEnqueueNDRangeKernel() call, at client side. This makes the clspv route a little tricky: we need to spec the workgroup size in the CL kernel source, using __attribute__((reqd_work_group_size(x, y, z)))

The end result is that the CL -> clspv -> SPIRV -> Vulkan overhead takes quite a bit of performance away. The overhead is tolerable on nvidia RTX, but it is too much on Intel:

NVIDIA GeForce RTX 3060 Laptop GPU NVIDIA Corporation with [30 units]
  OPENCL:
    rayt:  2953 µs
    boun:  2989 µs
    binp:  1458 µs
  CLSPV+VULKAN:
    rayt:  3543 μs
    boun:  3586 μs
    binp:  1224 μs
Intel(R) UHD Graphics 770 (ADL-S GT1)
  OPENCL:
    rayt: 22017 µs
    boun: 20927 µs
    binp:  7310 µs
  CLSPV+VULKAN:
    rayt: 44635 μs
    boun: 40133 μs
    binp:  8490 μs

CONCLUSION

Depending on the platform, my compute time goes up between +20% and +100% when using transpiled OpenCL kernels via clspv. I should also mention that I have found the performance difference between CUDA and OpenCL insignificant.