Clip Grad Norm on GPU without sync

PhysicsGaunt · March 20, 2025, 1:09am

( Not sure about the category, sorry )

I am running a training on CUDA and have noticed that the clip_grad_norm function seems to be quite slow, I believe this is due to the fact that is uses is_nonzero somewhere ( I can not find where ) which synchronizes with the CPU. Is there a way to clip the grad norms without synchronizing with the CPU to save time?

Chinggam26 · March 30, 2025, 1:38pm

Hey @PhysicsGaunt , would you be able to share which tool was used to gather that flame graph?
I understand you are more interested in why it took a long time for device synchronization compared to actual compute, but I would like to believe that is more of a limitation for all kernels and should be quite small if there was enough data for the GPU to munch through and the behavior you are seeing would be just a slight overhead at best.

It would be great if the driver code could be shared/explained, as in what kind of params were passed to clip_grad_norm_() to get above results.

It looks like this is the is_nonzero(): pytorch/aten/src/ATen/native/prim_native_functions.cpp at main · pytorch/pytorch · GitHub that gets invoked and then it calls _local_scalar_dense() which then waits for sync to happen here: pytorch/aten/src/ATen/native/cuda/CUDAScalar.cu at main · pytorch/pytorch · GitHub