CUDA terminology can overwhelm you. SM, Warp, Core, Thread, Block, Scheduler, Grid. I have been writing CUDA kernels for my RTX 3070, and I thought I would write down what I learned.
First things first: when you write a compute kernel, you write scalar code. So in my case, doing Photon Mapping, my kernel code handles a single ray, and one primitive at a time.
Compare this with writing SIMD CPU code: you explicitly work on arrays, writing vector code, not scalar code. In single-precision floating point operations, that means a vector of 8 (AVX) values at a time, or 16 (AVX512) values at a time.
This does not mean that the hardware executes this as scalar code, though. The GPU will execute your code in batches of 32. This batch is called a warp. Theoretically, warpsizes could be something different than 32, but for GeForces, they are always 32. So in a way, what the CPU does explicitly, 8- or 16-wide, the GPU does 32-wide, implicitly. Warps are executed (by 32 CUDA cores) in sync, much like the lanes of a SIMD register in a CPU. (Caveat emptor: only half the CUDA cores of Ampere GPUs can do integer operations, so throughput is halved for that.)
The workhorses inside an Ampere GPU, are the Steaming Multiprocessors, or SM for short. An Ampere GPU, like the RTX 3070, has SMs that can execute 4 of these warps at the same time. To do this, it has 4 schedulers per SM.
Let's go back to the software side: your kernel. If you have a million photons that you want to trace, your kernel will be executed a million times. In CUDA terminology, there are a million threads of execution. And threads are grouped in blocks. All the blocks together, is the grid. And every block will be assigned to an SM for computation.
When a warp is scheduled and running, the 32 threads in a warp could diverge. If one of the threads is blocked waiting on something, the whole warp is blocked. This is bad. But luckily, the scheduler will intervene, and switch out the warp for a non-blocked warp. Each of the 4 schedulers in a SM can keep up to 12 warps in flight. Often there will be at least one warp ready to run. The net effect is that the latencies are hidden.
When a scheduler has not a single warp that is ready to advance, compute throughput is lost. The NSIGHT Compute profiling tool can detect how often this happens for your kernel.
Even though your kernel is written in scalar form, each iteration of your kernel is still responsible for run-time selecting the right work! That is why nearly every CUDA program will contain the following code:
const uint32_t index = blockIdx.x * blockDim.x + threadIdx.x;
foo = bar[ index ];
The upshot of all this, is that if your kernel does a = b + c then, provided the memory bandwidth is there, then every SM executes 4x32 = 128 instances of this, and with the 48 SMs of the 3070, that means 6144 instances. So 6144 b values are added to 6144 c values and assigned to 6144 a values.
Note that the best I could hope for on my Xeon W2140B workstation with AVX512 is 8 cores each computing 16 floats, which is 128 instances (or possibly double that, if both execution units can be kept fed.)
Let me conclude by showing what you can do with all the compute power. Here is a scene that is globally illuminated by 2M photons that bounced once. I can compute this in a handfull of milliseconds on the rtx 3070. Not bad!