Weird CUDA sync behavior involving MSELoss

iacolippo · February 7, 2020, 3:15pm

I am trying to perform some asynchronous processing where part of an algorithm is executed on an accelerator and then some processing is done on the GPU asynchronously.

I have a weird behavior involving F.mse_loss: I have commented out each line and it seems to cause a synchronization (defeating my goal of running async processing).

This is solved simply by switching the line

loss = F.mse_loss(y_hat, target)

with

loss = ((y_hat - target) ** 2).mean()

y_hat and target are CUDA Tensors of size (batch_size, 1)

I only call

loss.backward()
optimizer.step()
optimizer.zero_grad()

afterward and the sync issue goes away if I write the MSE explicitly.

I have tried to reproduce in an isolated way the problem, but running timings on F.mse_loss alone does not show this problem.

Is there any reason why a synchronization might be happening?

ptrblck · February 8, 2020, 7:34am

How are you detecting the synchronization?
Are you seeing some syncs in nvprof or nsight?

iacolippo · February 10, 2020, 9:03am

I have added a dummy operation that artificially increases the running time when not done asynchronously.

for i in range(3):
            torch.mm(input, input.t())

input array is shape (3000, 50000). Unfortunately the code contains some proprietary stuff so I can’t share, I’ll see if I can make a version I can post here. And I’ll run nvprof

iacolippo · February 10, 2020, 1:08pm

I have issues with nvprof and I’m not sysadmin so it’ll take some time to fix that, but I managed to reproduce the issue on a dual GPU machine (2 x V100)

gist.github.com

https://gist.github.com/iacolippo/d7e7c26e980cbd4a9b86c24db6ca922e

async_processing_2gpus.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import time\n",
    "\n",

This file has been truncated. show original

hope this helps in figuring this out!

iacolippo · February 13, 2020, 9:17am

I have tried to reproduce in an even simpler setting, but apparently this happens only when using two GPUs. Doing all the operations on a single GPU gets the same timing for both F.mse_loss and writing it explicitly in terms of difference, power and mean.