Initialization Failure of nn.DataParallel on NVIDIA A6000

PyTorch version: 1.10.0
GPU: NVIDIA RTX A6000

I have recently come across an issue when using DataParallel on A6000 GPUs. When wrapping a module with DataParallel, the weights on the auxiliary GPUs (outside of cuda:0) are all set to zero and the correct initialization is not copied over. This creates inaccurate results for the splits of the input batch evaluated on each of these models.

Strangely, these weights remain at zero, although sometimes the output from the layer is unpredictable (not zero, despite the output being zero just before being returned in the layer). I’ve attached a script which reproduces the issue, and prints the incorrect weights.

import os
import torch
import torch.nn as nn

os.environ['NCCL_P2P_DISABLE'] = '1'
os.environ['NCCL_IB_DISABLE'] = '1'
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '1,2,3'

class Test(nn.Module):
    def __init__(self):
        super().__init__()
        self.a = torch.nn.Linear(10,10)
        self.a.weight.data *= 100.
    def forward(self, dict):
        print(dict['tensor'])     # Tensor is distributed properly
        print(self.a.weight)      # Weight is not distributed properly

        out = self.a(dict['tensor'])

        print(out)                # Output has all zeros for GPUs beyond 0
        print(out.sum())          # Sum is all zeros for GPUs beyond 0
        return out


t = Test().cuda()
opt = torch.optim.Adam(t.parameters())
tt = nn.DataParallel(t)

tensor = {"tensor":torch.randn(2,10)}

for i in range(10000):
    print('Iteration: '+str(i))
    opt.zero_grad()
    a = tt(tensor)

    # Output has the incorrect result for tensor distributed onto GPU 1, etc..
    # However, for iterations after zero, this is not always 0 (strange)
    # Unsure where this value comes from
    print(a)

    print(a.sum())                 # Sum is not equivalent to sum of individual outputs on each GPU
    a.sum().backward()
    opt.step()

I have tested the same code on Quadro RTX 6000 GPUs with the same version of PyTorch, and it behaves properly, so this issue seems to be related to A6000s.

It sounds as if your system cannot properly communicate between the GPUs so I would recommend to run the NCCL tests and check the numerical errors as well as the bandwidth.