Weird CUDA sync behavior involving MSELoss

I am trying to perform some asynchronous processing where part of an algorithm is executed on an accelerator and then some processing is done on the GPU asynchronously.

I have a weird behavior involving F.mse_loss: I have commented out each line and it seems to cause a synchronization (defeating my goal of running async processing).

This is solved simply by switching the line

loss = F.mse_loss(y_hat, target)

with

loss = ((y_hat - target) ** 2).mean()

y_hat and target are CUDA Tensors of size (batch_size, 1)

I only call

loss.backward()
optimizer.step()
optimizer.zero_grad()

afterward and the sync issue goes away if I write the MSE explicitly.

I have tried to reproduce in an isolated way the problem, but running timings on F.mse_loss alone does not show this problem.

Is there any reason why a synchronization might be happening?

How are you detecting the synchronization?
Are you seeing some syncs in nvprof or nsight?

I have added a dummy operation that artificially increases the running time when not done asynchronously.

for i in range(3):
            torch.mm(input, input.t())

input array is shape (3000, 50000). Unfortunately the code contains some proprietary stuff so I can’t share, I’ll see if I can make a version I can post here. And I’ll run nvprof

I have issues with nvprof and I’m not sysadmin so it’ll take some time to fix that, but I managed to reproduce the issue on a dual GPU machine (2 x V100)

hope this helps in figuring this out!

I have tried to reproduce in an even simpler setting, but apparently this happens only when using two GPUs. Doing all the operations on a single GPU gets the same timing for both F.mse_loss and writing it explicitly in terms of difference, power and mean.