Would you tell me why you did not consider cuda shared memory size?


I am learning from the source code, and I got to know this line:

Here we allocate as many shared memory according to the number of parallelization. However, I also learned that for most gpus, the max shm of each CUDA SM is around 48k. If there are 48 blocks, each block would have at most 1k shared memory. Would you please tell me why the pytorch code does not consider that in the implementation of operators?

So the maximum number of threads in the block is 1024 (per SpatialSoftMax_getBlockSize), the accscalar will be 32 or 64 bits, so we’re using 4k-8k shmem per block. SpatialSoftMax_getLaunchSizes now uses the occupancy API to determine the maximum number of active blocks.
Which part is not considered?

Best regards


Thanks for replying !!!

Will I be ok if each block does not exceed 48k, but the whole SM exceed 48k ?

By the way, did the number of active blocks is determined by the cuda cudaOccupancyMaxPotentialBlockSize method ? This method outputs both recommended block size and grid sizes, and only the output grid size is used as active number of blocks with the block size not used. Is that the way it works ?