Replacing torch.zeros internals with cudaMemset instead of fill kernel

anguswang · September 5, 2024, 6:15am

Sorry for the late response. I have done some basic profiling using the PyTorch Profiler. It looks like changing the implementation to using cudaMemset does improve the performance of torch.zeros.

I profiled the creation of a fp16 zeros tensor of shape (1024, 1024, 1024) on the GPU as shown below:

import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], 
             record_shapes=True, 
             profile_memory=True, 
             use_cuda=True) as prof:
    with record_function("torch_zeros"):
        zero_tensor = torch.zeros((1024, 1024, 1024),
                                  dtype=torch.float16,
                                  pin_memory=False,
                                  device='cuda')

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

I got the following results for the current implementation:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            torch_zeros         6.35%      19.267ms        99.19%     300.836ms     300.836ms       0.000us         0.00%       2.798ms       2.798ms           0 b           0 b       2.00 Gb           0 b             1  
                                            aten::zeros         0.04%     130.736us        92.18%     279.572ms     279.572ms       0.000us         0.00%       2.798ms       2.798ms           0 b           0 b       2.00 Gb           0 b             1  
                                            aten::zero_         0.01%      38.260us         2.17%       6.594ms       6.594ms       0.000us         0.00%       2.798ms       2.798ms           0 b           0 b           0 b           0 b             1  
                                            aten::fill_         0.02%      57.585us         2.16%       6.556ms       6.556ms       2.798ms       100.00%       2.798ms       2.798ms           0 b           0 b           0 b           0 b             1  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.798ms       100.00%       2.798ms       2.798ms           0 b           0 b           0 b           0 b             1  
                                            torch_zeros         0.00%       0.000us         0.00%       0.000us       0.000us       2.798ms       100.00%       2.798ms       2.798ms           0 b           0 b           0 b           0 b             1  
                                     cudaGetDeviceCount         0.00%       1.407us         0.00%       1.407us       0.703us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                             cudaGetDeviceProperties_v2         0.66%       1.996ms         0.66%       1.996ms       1.996ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                            aten::empty         0.04%     108.655us        89.96%     272.847ms     272.847ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b       2.00 Gb       2.00 Gb             1  
                       cudaDeviceGetStreamPriorityRange        89.84%     272.472ms        89.84%     272.472ms     272.472ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 303.297ms
Self CUDA time total: 2.798ms

And the following results if I swap the fill kernel with a cudaMemset call:

------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         torch_zeros         6.61%      19.356ms        99.57%     291.549ms     291.549ms       0.000us         0.00%       1.579ms       1.579ms           0 b           0 b       2.00 Gb           0 b             1  
                         aten::zeros         0.05%     133.432us        92.29%     270.246ms     270.246ms       0.000us         0.00%       1.579ms       1.579ms           0 b           0 b       2.00 Gb           0 b             1  
                         aten::zero_         0.01%      38.235us         0.04%     102.532us     102.532us       0.000us         0.00%       1.579ms       1.579ms           0 b           0 b           0 b           0 b             1  
                         aten::fill_         0.01%      39.738us         0.02%      64.297us      64.297us       1.579ms       100.00%       1.579ms       1.579ms           0 b           0 b           0 b           0 b             1  
                     Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.579ms       100.00%       1.579ms       1.579ms           0 b           0 b           0 b           0 b             1  
                         torch_zeros         0.00%       0.000us         0.00%       0.000us       0.000us       1.579ms       100.00%       1.579ms       1.579ms           0 b           0 b           0 b           0 b             1  
                  cudaGetDeviceCount         0.00%       1.285us         0.00%       1.285us       0.643us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
          cudaGetDeviceProperties_v2         0.66%       1.946ms         0.66%       1.946ms       1.946ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                         aten::empty         0.04%     120.588us        92.21%     270.010ms     270.010ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b       2.00 Gb       2.00 Gb             1  
    cudaDeviceGetStreamPriorityRange        92.08%     269.620ms        92.08%     269.620ms     269.620ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 292.815ms
Self CUDA time total: 1.579ms

The time saved is not very significant, but it is still faster and in my opinion, using cudaMemset seems a lot more straightforward and intuitive here.