( Not sure about the category, sorry
)
I am running a training on CUDA and have noticed that the clip_grad_norm function seems to be quite slow, I believe this is due to the fact that is uses is_nonzero somewhere ( I can not find where ) which synchronizes with the CPU. Is there a way to clip the grad norms without synchronizing with the CPU to save time?
Hey @PhysicsGaunt , would you be able to share which tool was used to gather that flame graph?
I understand you are more interested in why it took a long time for device synchronization compared to actual compute, but I would like to believe that is more of a limitation for all kernels and should be quite small if there was enough data for the GPU to munch through and the behavior you are seeing would be just a slight overhead at best.
It would be great if the driver code could be shared/explained, as in what kind of params were passed to clip_grad_norm_()
to get above results.
It looks like this is the is_nonzero()
: pytorch/aten/src/ATen/native/prim_native_functions.cpp at main · pytorch/pytorch · GitHub that gets invoked and then it calls _local_scalar_dense()
which then waits for sync to happen here: pytorch/aten/src/ATen/native/cuda/CUDAScalar.cu at main · pytorch/pytorch · GitHub