Multi-gpu training hangs due to an `if`

TigerYan86 · August 4, 2022, 7:23pm

Hi,

I discovered recently my 8-GPU training will hang if I have this if (using DDP, all GPUs saturate at 100%, happens randomly at some epoch in the middle of a job):

(modification of backproject() from Atlas/model.py at master · magicleap/Atlas · GitHub)

    ...
    volume = torch.zeros(
        batch, channels, nx * ny * nz, dtype=features.dtype, device=device
    )
    # `valid` shape: [b, nx*ny*nz]
    if valid.any():
      for b in range(batch):
          volume[b, :, valid[b]] = features[b, :, py[b, valid[b]], px[b, valid[b]]]

    volume = volume.view(batch, channels, nx, ny, nz)
    valid = valid.view(batch, 1, nx, ny, nz)

    return volume, valid

after removing the if my model trains well. The purpose of the if was to avoid unnecessary index to save time. I do know this might cause GPUs execute different graph and diverge between samples, but is that the main reason? Is it generally not encouraged to have such data-dependent branching code in training? (I found post like this that obviously uses if which could lead to even more divergence I assume.)

Any insights are appreciated!

Yanli_Zhao · August 9, 2022, 11:47am

do you want to try passing ‘find_unused_parameters=True’ in the DDP wrapper? it is trying to support different graph training across ranks. But it is possible your case is a edge case and can not be supported by DDP

TigerYan86 · August 9, 2022, 4:32pm

thanks Yanli. Let me give it a try.

TigerYan86 · August 9, 2022, 10:09pm

in fact that flag is by default True in pytorch-lightning (which I’m using), and it only outputs warning:

find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters.

so looks like there is no unused parameters.