Hi
I’m trying to parallelize a somewhat large encoder-decoder model. I have the input data on GPU 0 fed into the encoder, then I transfer the latent code to GPU 1 and feed it to the decoder, then compute the losses on GPU 1.
One particular loss is implemented as a torch.autograd.Function
and something triggers a device-side assert with out of bound indices in that snippet:
batchV = V.view((-1, 3))
# Compute half cotangents and double the triangle areas
C, TwoA = half_cotangent(V, faces)
batchC = C.view((-1, 3))
# Adjust face indices to stack:
offset = torch.arange(V.shape[0], device=V.device).view((-1, 1, 1)) * V.shape[1]
# Add the offset to the faces passed as parameters and save in a different tensor
F = faces + offset
batchF = F.view((-1, 3))
# import ipdb; ipdb.set_trace()
# Fails here if not run with CUDA_LAUNCH_BLOCKING=1
rows = batchF[:, [1, 2, 0]].view(
1, -1
) # 1,2,0 i.e to vertex 2-3 associate cot(23)
cols = batchF[:, [2, 0, 1]].view(
1, -1
)
The code runs fine for 2 samples, then on the third I get the device-side assert. The dataloader is shuffling the samples, yet it consistently fails on the third sample.
Debugging in pdb gives me the following traceback:
Traceback (most recent call last):
File "train.py", line 593, in <module>
exp_flag,
File "train.py", line 524, in compute_losses_real
exp_real_and,
File "<thefile>.py", line 23, in __call__
Lx = self.laplacian.apply(V, self.F[mask])
File "<thefile>.py", line 64, in forward
rows = batchF[:, [1, 2, 0]].view(
RuntimeError: size is inconsistent with indices: for dim 0, size is 7380 but found index 4684893058448109737
I then ran the same exact code with CUDA_LAUNCH_BLOCKING=1 and the code doesn’t crash, and the loss decreases.
What I already tried:
-
In case this might be related, I disabled pinned memory and non-blocking data transfers from host to GPU, but the problem persists.
-
I added
torch.cuda.synchronize()
right aboverows = batchF[:, [1, 2, 0]].view(
with no success. -
This code works fine when the model is on a single GPU.
Any help would be much appreciated!