Sorry for the late response. I have done some basic profiling using the PyTorch Profiler. It looks like changing the implementation to using cudaMemset
does improve the performance of torch.zeros
.
I profiled the creation of a fp16 zeros tensor of shape (1024, 1024, 1024) on the GPU as shown below:
import torch
from torch.profiler import profile, record_function, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
use_cuda=True) as prof:
with record_function("torch_zeros"):
zero_tensor = torch.zeros((1024, 1024, 1024),
dtype=torch.float16,
pin_memory=False,
device='cuda')
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
I got the following results for the current implementation:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
torch_zeros 6.35% 19.267ms 99.19% 300.836ms 300.836ms 0.000us 0.00% 2.798ms 2.798ms 0 b 0 b 2.00 Gb 0 b 1
aten::zeros 0.04% 130.736us 92.18% 279.572ms 279.572ms 0.000us 0.00% 2.798ms 2.798ms 0 b 0 b 2.00 Gb 0 b 1
aten::zero_ 0.01% 38.260us 2.17% 6.594ms 6.594ms 0.000us 0.00% 2.798ms 2.798ms 0 b 0 b 0 b 0 b 1
aten::fill_ 0.02% 57.585us 2.16% 6.556ms 6.556ms 2.798ms 100.00% 2.798ms 2.798ms 0 b 0 b 0 b 0 b 1
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 2.798ms 100.00% 2.798ms 2.798ms 0 b 0 b 0 b 0 b 1
torch_zeros 0.00% 0.000us 0.00% 0.000us 0.000us 2.798ms 100.00% 2.798ms 2.798ms 0 b 0 b 0 b 0 b 1
cudaGetDeviceCount 0.00% 1.407us 0.00% 1.407us 0.703us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 2
cudaGetDeviceProperties_v2 0.66% 1.996ms 0.66% 1.996ms 1.996ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
aten::empty 0.04% 108.655us 89.96% 272.847ms 272.847ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 2.00 Gb 2.00 Gb 1
cudaDeviceGetStreamPriorityRange 89.84% 272.472ms 89.84% 272.472ms 272.472ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 303.297ms
Self CUDA time total: 2.798ms
And the following results if I swap the fill kernel with a cudaMemset
call:
------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
torch_zeros 6.61% 19.356ms 99.57% 291.549ms 291.549ms 0.000us 0.00% 1.579ms 1.579ms 0 b 0 b 2.00 Gb 0 b 1
aten::zeros 0.05% 133.432us 92.29% 270.246ms 270.246ms 0.000us 0.00% 1.579ms 1.579ms 0 b 0 b 2.00 Gb 0 b 1
aten::zero_ 0.01% 38.235us 0.04% 102.532us 102.532us 0.000us 0.00% 1.579ms 1.579ms 0 b 0 b 0 b 0 b 1
aten::fill_ 0.01% 39.738us 0.02% 64.297us 64.297us 1.579ms 100.00% 1.579ms 1.579ms 0 b 0 b 0 b 0 b 1
Memset (Device) 0.00% 0.000us 0.00% 0.000us 0.000us 1.579ms 100.00% 1.579ms 1.579ms 0 b 0 b 0 b 0 b 1
torch_zeros 0.00% 0.000us 0.00% 0.000us 0.000us 1.579ms 100.00% 1.579ms 1.579ms 0 b 0 b 0 b 0 b 1
cudaGetDeviceCount 0.00% 1.285us 0.00% 1.285us 0.643us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 2
cudaGetDeviceProperties_v2 0.66% 1.946ms 0.66% 1.946ms 1.946ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
aten::empty 0.04% 120.588us 92.21% 270.010ms 270.010ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 2.00 Gb 2.00 Gb 1
cudaDeviceGetStreamPriorityRange 92.08% 269.620ms 92.08% 269.620ms 269.620ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
------------------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 292.815ms
Self CUDA time total: 1.579ms
The time saved is not very significant, but it is still faster and in my opinion, using cudaMemset
seems a lot more straightforward and intuitive here.