Once again, I find myself porting CUDA code to OpenCL. And I had to remind myself of the process, and the differences between the two. Let's document, so future Bram will catch up quickly again.
The major difference is of course compatibility. OpenCL is supported by Nvidia, Amd, Intel, etc. CUDA is only supported by Nvidia. It's the whole reason of this porting exercise.
The CUDA-kernel can be pre-compiled by the developer using nvcc, and then be shipped as a PTX file. The OpenCL-kernel typically ships as C-like source.
The function definition of a CUDA-kernel gets prefixed with __kernel whereas an OpenCL uses a __global__ prefix.
A CUDA-kernel uses blockIdx and blockDim to determine which sub-part of the input data it is supposed to process. An OpenCL kernel uses get_global_id() instead.
Using 16 bit floating point arithmetic is as easy as #include <cuda_fp16.h> and using __half. On OpenCL you need to check for an extension and add #pragma OPENCL EXTENSION cl_khr_fp16 : enable
In CUDA, constant input data is prefixed with __constant__ which the host then sets with the cuMemcpyHtoDAsync() function after getting the symbol using the cuModuleGetGlobal() function. In OpenCL if the constant input data is large, you prefix it with __global__ and the host sets it with the clCreateBuffer() and clEnqueueWriteBuffer()functions.
Half float literals need a constructor in CUDA: __half(1.0) but can use a suffix in OpenCL, where 1.0h can be used.
Many operators need a function call in CUDA. The reciprocal in CUDA is hrcp(x) and a simple <= on a half float becomes __hlt(a,b) which makes it less legible.
A minor diffence is that CUDA's atomicAdd is atomic_add in OpenCL.
CUDA has asserts but OpenCL does not.