Dataparallel freezing on model.forward

williamFalcon · October 7, 2018, 11:42pm

Hi, I have a model as such:

model = MyNetwork()   

if self.hparams.nb_gpus > 1:
    model = nn.DataParallel(model, device_ids=[0, 1, 2, 3])
    device = torch.device("cuda:0")
    model.to(device)
    
for data in loader():
    gpu_data = data.to(device)
    # ------------> HANGS IN THE MODEL CALL...
    out = model(gpu_data)

Q1: If I don’t add the nn.DataParallel call, this works just fine on 1 GPU. Any ideas? I’m also running this on a compute cluster (HPC) managed by SLURM. I’m reserving a full node which has 4 gpus on it (1080tis)…

Q2: do i need multiple device calls for multiple gpus? ie:

device_a = torch.device("cuda:0")
device_b = torch.device("cuda:1")
device_c = torch.device("cuda:2")
device_d = torch.device("cuda:3")