Multi-gpu training hangs due to an `if`

Hi,

I discovered recently my 8-GPU training will hang if I have this if (using DDP, all GPUs saturate at 100%, happens randomly at some epoch in the middle of a job):

(modification of backproject() from Atlas/model.py at master · magicleap/Atlas · GitHub)

    ...
    volume = torch.zeros(
        batch, channels, nx * ny * nz, dtype=features.dtype, device=device
    )
    # `valid` shape: [b, nx*ny*nz]
    if valid.any():
      for b in range(batch):
          volume[b, :, valid[b]] = features[b, :, py[b, valid[b]], px[b, valid[b]]]

    volume = volume.view(batch, channels, nx, ny, nz)
    valid = valid.view(batch, 1, nx, ny, nz)

    return volume, valid

after removing the if my model trains well. The purpose of the if was to avoid unnecessary index to save time. I do know this might cause GPUs execute different graph and diverge between samples, but is that the main reason? Is it generally not encouraged to have such data-dependent branching code in training? (I found post like this that obviously uses if which could lead to even more divergence I assume.)

Any insights are appreciated!

do you want to try passing ‘find_unused_parameters=True’ in the DDP wrapper? it is trying to support different graph training across ranks. But it is possible your case is a edge case and can not be supported by DDP

thanks Yanli. Let me give it a try.

in fact that flag is by default True in pytorch-lightning (which I’m using), and it only outputs warning:

find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters.

so looks like there is no unused parameters.