Model parallel issue that disappears with CUDA_LAUNCH_BLOCKING=1


I’m trying to parallelize a somewhat large encoder-decoder model. I have the input data on GPU 0 fed into the encoder, then I transfer the latent code to GPU 1 and feed it to the decoder, then compute the losses on GPU 1.

One particular loss is implemented as a torch.autograd.Function and something triggers a device-side assert with out of bound indices in that snippet:

batchV = V.view((-1, 3))

# Compute half cotangents and double the triangle areas
C, TwoA = half_cotangent(V, faces)

batchC = C.view((-1, 3))

# Adjust face indices to stack:
offset = torch.arange(V.shape[0], device=V.device).view((-1, 1, 1)) * V.shape[1]

# Add the offset to the faces passed as parameters and save in a different tensor
F = faces + offset
batchF = F.view((-1, 3))

# import ipdb; ipdb.set_trace()
# Fails here if not run with CUDA_LAUNCH_BLOCKING=1
rows = batchF[:, [1, 2, 0]].view(
    1, -1
)  # 1,2,0 i.e to vertex 2-3 associate cot(23)
cols = batchF[:, [2, 0, 1]].view(
    1, -1

The code runs fine for 2 samples, then on the third I get the device-side assert. The dataloader is shuffling the samples, yet it consistently fails on the third sample.

Debugging in pdb gives me the following traceback:

Traceback (most recent call last):
  File "", line 593, in <module>
  File "", line 524, in compute_losses_real
  File "<thefile>.py", line 23, in __call__
    Lx = self.laplacian.apply(V, self.F[mask])
  File "<thefile>.py", line 64, in forward
    rows = batchF[:, [1, 2, 0]].view(
RuntimeError: size is inconsistent with indices: for dim 0, size is 7380 but found index 4684893058448109737

I then ran the same exact code with CUDA_LAUNCH_BLOCKING=1 and the code doesn’t crash, and the loss decreases.

What I already tried:

  • In case this might be related, I disabled pinned memory and non-blocking data transfers from host to GPU, but the problem persists.

  • I added torch.cuda.synchronize() right above rows = batchF[:, [1, 2, 0]].view( with no success.

  • This code works fine when the model is on a single GPU.

Any help would be much appreciated!


It does look like we’re missing a synchronization point…
Could you provide a small code sample that triggers this issue so that we can reproduce locally please? :slight_smile:

Hi, would you mind providing your complete code? from the snippet you provided, it is hard to say the root cause.