DataParallel freezes

eric01300 · February 2, 2019, 5:52am

I just got a new machine with 2 gtx1080ti, so I wanted to try using nn.DataParallel for faster training. I have created a test code to make sure nn.DataParallel works, but it seems to get stuck in the forward().


import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
class RandomDataset(Dataset):
	def __init__(self):
		self.a = torch.randn(123, 15)
		pass
	def __getitem__(self, index):
		return self.a[index]
	
	def __len__(self):
		return 123

class Model(nn.Module):
	def __init__(self, i_s, o_s):
		super(Model, self).__init__()
		self.fc = nn.Linear(i_s, o_s)
	def forward(self, x):
		output = self.fc(x)
		return output

dl = DataLoader(dataset=RandomDataset(), batch_size=16, shuffle=True)
model = Model(15, 3)
print(torch.cuda.device_count())
model = nn.DataParallel(model)


model.to(device)
print('it works!!')
for data in dl:
	x = data.to(device)
	y = model(x)
	print('!!it works')

Training works perfectly if I don’t use nn.DataParallel.

I did not know anything about setting up cuda because I was using cloud service previously, so I just followed the steps here.

JuanFMontesinos · February 2, 2019, 11:15pm

Does it throw any error? or just get frozen

JuanFMontesinos · February 3, 2019, 12:33pm

Such a strange thing. Can you train in both gpus independently? Have you tried both gpus? Nvidia-smi works?

sh0416 · December 3, 2019, 2:09pm

I encountered the same issue above, and I resolve this issue by chaning the following way.

model = ResNet().to(device)
model = nn.DataParallel(model)

to

model = ResNet()
model = nn.DataParallel(model).to(device)

In my case, it works.

BTW, the model should be in the output_device as the official document mentioned.

Thanks,