Hi guys,
I tested a simple example with nn.DataParallel() to use multiple GPUs, but got a hang.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
class MyNet(nn.Module):
def __init__(self):
super(MyNet, self).__init__()
self.linear = nn.Linear(2,1)
def forward(self, x):
h = self.linear(x)
return h
epochs = 2000
lr = 1e-3
momentum = 0
w_decay = 1e-5
train_data = torch.randn(288,2)
train_label = torch.zeros([288], dtype=torch.long)
num_gpu = list(range(torch.cuda.device_count()))
model = nn.DataParallel(MyNet().cuda(0), device_ids = num_gpu)#.cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = lr, weight_decay = w_decay)
print "Starting training"
model.train()
for epoch in range(epochs):
optimizer.zero_grad()
inputs = Variable(train_data.cuda(0))
labels = Variable(train_label.cuda(0))
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print("epoch{}, loss: {}".format(epoch, loss.data.item()))
It hangs when I try to forward the data to the model. nvidia-smi gives
Thu Aug 16 09:56:56 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.67 Driver Version: 390.67 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:1B:00.0 Off | N/A |
| 28% 39C P8 25W / 250W | 1087MiB / 12066MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN V Off | 00000000:1C:00.0 Off | N/A |
| 28% 41C P2 39W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN V Off | 00000000:1D:00.0 Off | N/A |
| 31% 45C P2 41W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN V Off | 00000000:1E:00.0 Off | N/A |
| 31% 45C P2 40W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 TITAN V Off | 00000000:3D:00.0 Off | N/A |
| 28% 39C P2 38W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 TITAN V Off | 00000000:3E:00.0 Off | N/A |
| 28% 41C P2 40W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 6 TITAN V Off | 00000000:3F:00.0 Off | N/A |
| 28% 40C P2 38W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 7 TITAN V Off | 00000000:40:00.0 Off | N/A |
| 31% 45C P2 40W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 8 TITAN V Off | 00000000:41:00.0 Off | N/A |
| 29% 43C P2 41W / 250W | 1087MiB / 12066MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 131331 C python 1076MiB |
| 1 131331 C python 1076MiB |
| 2 131331 C python 1076MiB |
| 3 131331 C python 1076MiB |
| 4 131331 C python 1076MiB |
| 5 131331 C python 1076MiB |
| 6 131331 C python 1076MiB |
| 7 131331 C python 1076MiB |
| 8 131331 C python 1076MiB |
+-----------------------------------------------------------------------------+
I have tried the solution in this, but it didn’t work.
I use
- CUDA 9.1.85
- Pytorch 0.4.1 (installed by pip)
- Python 2.7.13
- Debian 4.9.110-3+deb9u1 (2018-08-03) x86_64 GNU/Linux
- 9 TITAN V cards
Any ideas to solve this issue? Or I should let NVIDIA’s folks see this issue?