Torch::empty taking significant amount of CPU time

I have a model that uses a number of custom CUDA operations. When benchmarking/profiling one of the core modules I’m seeing that roughly one out of every ten runs torch::empty is taking as much as 200us on CPU per call, which across all of the 19 calls that happen in one forward pass of my model is making up 85% of the CPU time. All of the operations in the module are on GPU, but these empty tensor creation calls are the primary bottleneck for the runs where this is happening.

Why is this happening? I expect these to be essentially free. Is there a workaround?

I’m using pytorch 1.8 with CUDA 11.2 in the standard Megatron container (NGC’s pytorch 20.12).