DataParallel freezes

I just got a new machine with 2 gtx1080ti, so I wanted to try using nn.DataParallel for faster training. I have created a test code to make sure nn.DataParallel works, but it seems to get stuck in the forward().


import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
class RandomDataset(Dataset):
	def __init__(self):
		self.a = torch.randn(123, 15)
		pass
	def __getitem__(self, index):
		return self.a[index]
	
	def __len__(self):
		return 123

class Model(nn.Module):
	def __init__(self, i_s, o_s):
		super(Model, self).__init__()
		self.fc = nn.Linear(i_s, o_s)
	def forward(self, x):
		output = self.fc(x)
		return output

dl = DataLoader(dataset=RandomDataset(), batch_size=16, shuffle=True)
model = Model(15, 3)
print(torch.cuda.device_count())
model = nn.DataParallel(model)


model.to(device)
print('it works!!')
for data in dl:
	x = data.to(device)
	y = model(x)
	print('!!it works')

Training works perfectly if I don’t use nn.DataParallel.

I did not know anything about setting up cuda because I was using cloud service previously, so I just followed the steps here.

Does it throw any error? or just get frozen

It just gets frozen without any errors, so I have to restart the computer again.

Such a strange thing. Can you train in both gpus independently? Have you tried both gpus? Nvidia-smi works?

I changed the line below

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

To this enabled me to successfully train on the second gpu.

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')

Nvidia-smi shows the result below when I am running on both of the gpus separately with 2 separate codes. This shows that both of my GPUs are successfully training separately, so the problem is in DataParallel?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:0A:00.0  On |                  N/A |
| 41%   71C    P2    71W / 250W |   1854MiB / 11175MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:0B:00.0 Off |                  N/A |
| 23%   52C    P2   324W / 250W |   1481MiB / 11178MiB |     51%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1216      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1254      G   /usr/bin/gnome-shell                          50MiB |
|    0      1456      G   /usr/lib/xorg/Xorg                           170MiB |
|    0      1587      G   /usr/bin/gnome-shell                         141MiB |
|    0      9307      C   python3                                     1469MiB |
|    1      9575      C   python3                                     1469MiB |
+-----------------------------------------------------------------------------+

I encountered the same issue above, and I resolve this issue by chaning the following way.

model = ResNet().to(device)
model = nn.DataParallel(model)

to

model = ResNet()
model = nn.DataParallel(model).to(device)

In my case, it works.

BTW, the model should be in the output_device as the official document mentioned.

Thanks,