Would you tell me why you did not consider cuda shared memory size?

coincheung · July 9, 2020, 6:08am

Hi,

I am learning from the source code, and I got to know this line:

pytorch/pytorch/blob/8e2841781eb569a43e3f0248ba7686f9cae19b9d/aten/src/ATen/native/cuda/SoftMax.cu#L128


}


template<typename accscalar_t, typename Kernel>
void SpatialSoftMax_getLaunchSizes(
    Kernel k,
    uint64_t outer_size, uint64_t dim_size, uint64_t inner_size,
    dim3& grid, dim3& block, uint32_t& smem_size) {
  block = SpatialSoftMax_getBlockSize(outer_size, dim_size, inner_size);
  uint32_t block_threads = block.x * block.y;
  smem_size = block.x == 1 ? 0 : block_threads * sizeof(accscalar_t);
  int max_active_blocks;
#if defined(__HIP_PLATFORM_HCC__) && HIP_VERSION < 305
  // HIP function signature is not compatible yet.
  uint32_t max_blocks;
  cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks,
                                                k, block_threads, smem_size);
  max_active_blocks = max_blocks;
#else
  cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_active_blocks,
                                                k, block_threads, smem_size);

Here we allocate as many shared memory according to the number of parallelization. However, I also learned that for most gpus, the max shm of each CUDA SM is around 48k. If there are 48 blocks, each block would have at most 1k shared memory. Would you please tell me why the pytorch code does not consider that in the implementation of operators?

tom · July 9, 2020, 7:00am

So the maximum number of threads in the block is 1024 (per SpatialSoftMax_getBlockSize), the accscalar will be 32 or 64 bits, so we’re using 4k-8k shmem per block. SpatialSoftMax_getLaunchSizes now uses the occupancy API to determine the maximum number of active blocks.
Which part is not considered?

Best regards

Thomas

coincheung · July 9, 2020, 7:38am

Thanks for replying !!!

Will I be ok if each block does not exceed 48k, but the whole SM exceed 48k ?

By the way, did the number of active blocks is determined by the cuda cudaOccupancyMaxPotentialBlockSize method ? This method outputs both recommended block size and grid sizes, and only the output grid size is used as active number of blocks with the block size not used. Is that the way it works ?