Large time consuming cudaMemcpyAsync between forward and backward pass in collected PyTorch profiler trace

There is a large amount of time spend in cudaMemcpyAsync, related to Memcpy DtoH (Device -> Pageable) operation , between forward and backward pass, I do not know where it comes from.

I noticed this when I used PyTorch profiler.

I am attaching the observation when viewed in chrome tracing.

I am using the PyTorch data parallel example code available in the PyTorch documentation.

Model: ResNet 18 from torchvision standard models.

As the profile shows it comes from aten::item, which is triggered by tensor.item() in Python and synchronizes the host with the GPU.
Remove unneeded item() calls to avoid synchronizing the code as the memcpy kernel doesn’t take such a long time but needs to wait for the GPU to finish its execution before it can copy the output tensor to the CPU.

1 Like

Sorry to piggyback on this topic, but I’m also experience a similar issue. However, in my case, I’m not the one calling item(), but instead it seems (according to the trace) that linalg_check_errors is calling it after computing the cholesky decomposition (all my tensors should be on the CUDA device):

This is for some custom code which is computing the cholesky at every layer, so (at least according to the flamegraph) it is forming the major bottleneck in my code. Any insight would be appreciated!

This is also expected as described in the docs:

When inputs are on a CUDA device, this function synchronizes that device with the CPU.
torch.linalg.cholesky_ex() for a version of this operation that skips the (slow) error checking by default and instead returns the debug information. This makes it a faster way to check if a matrix is positive-definite.

1 Like

I understand the synchronizing consumes the time, but why the consumed time is shown in the figure by “cudaMemcpyAsync” which is an async function? My expectation is that another function like cudaSync… consumes the time in the figure. thanks.

cudaMemcpyAsync will be async is possible and will fall back to a synchronizing operation if needed, e.g. if pageable host memory is used.

thanks. And other possible case that cudaMemcpyAsync will fall back to sync operation? In this example, I think it is D2D copy, not the case of host memory, why still fall back to sync operation.