Pytorch Data Parallel Not Using Second GPU

Hi,

I am using data-parallel across two GPUs.

Also, my second GPU is not used:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88       Driver Version: 418.88       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:26:00.0 Off |                  N/A |
| 27%   62C    P2    76W / 280W |   9523MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:27:00.0 Off |                  N/A |
|  0%   31C    P8    11W / 280W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24     C   python                                      9513MiB |
+-----------------------------------------------------------------------------+

How to make sure the second gpu is also used? I am wrapping my model in data parallel.

What model and data are you using?
Could you post your code (stripped down) so that we could have a look?

thanks for the reply,

my model is very simple:

class Net(nn.Module):
    def __init__(self, vocab_size):
        super(Net, self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = 300
        self.hidden_size = 300
        self.linear = nn.Linear(1024, 300)
        self.dropout_rate = 0.5
        self.embedding = nn.Embedding(self.vocab_size, self.embed_size)
        self.dropout = nn.Dropout(self.dropout_rate)
        self.LSTM = nn.LSTM(self.embed_size, self.hidden_size, bidirectional=True)
        self.multimodal_linear = nn.Linear(600, 2)

    def forward(self, s, c):
        r = self.linear(c)
        self.LSTM.flatten_parameters()
        embedded = self.embedding(s)
        embedded = self.dropout(embedded)
        # Each batch has the same maxlen, how to make data loader with custom maxlen?
        input_lengths = [sent.shape[1]]* sent.shape[0]
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True)
        output, hidden = self.LSTM(packed, None)
        output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
        bi_text = hidden[0][0,:,:].squeeze()
        o = self.m_linear(torch.cat((bi_text, r), dim=1)) 
        return o

This is wrapped by another class, which has some other functions as well.

and in that class, say testNet

self.net = Net(..)

and I use testNet = nn.DataParallel(testNet)

and use testNet.module.parameters as optimizer parameters.

@ptrblck hi, should i add any other details…could this be related to dataloader? I tried very large batch size as well like 2048 * 4 but still second gpu is not utilized

Thanks for the code!
I’ve just tried to run it on our machine and see all GPUs are used:

class Net(nn.Module):
    def __init__(self, vocab_size):
        super(Net, self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = 300
        self.hidden_size = 300
        self.linear = nn.Linear(1024, 300)
        self.dropout_rate = 0.5
        self.embedding = nn.Embedding(self.vocab_size, self.embed_size)
        self.dropout = nn.Dropout(self.dropout_rate)
        self.LSTM = nn.LSTM(self.embed_size, self.hidden_size, bidirectional=True)
        self.multimodal_linear = nn.Linear(600, 2)

    def forward(self, s, c):
        r = self.linear(c)
        self.LSTM.flatten_parameters()
        embedded = self.embedding(s)
        embedded = self.dropout(embedded)
        # Each batch has the same maxlen, how to make data loader with custom maxlen?
        input_lengths = torch.tensor([10]*s.size(0)).long() #[sent.shape[1]]* sent.shape[0]
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True)
        output, hidden = self.LSTM(packed, None)
        output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
        bi_text = hidden[0][0,:,:].squeeze()
        o = self.multimodal_linear(torch.cat((bi_text, r), dim=1))
        return o


device = 'cuda'

N = 64*64
model = Net(100).to(device)
model = nn.DataParallel(model)
s = torch.randint(0, 100, (N, 10)).to(device)
c = torch.randn(N, 1024).to(device)

for _ in range(100):
    out = model(s, c)
print(out)

I had to fix some minor issues to run the code (self.m_linear should probably be self.multimodel_linear in the forward).
Could you run my code and check, if all GPUs are utilized?

thanks @ptrblck,

I tried the code, I still get:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88       Driver Version: 418.88       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:0C:00.0 Off |                  N/A |
| 44%   69C    P2   247W / 250W |   2579MiB / 11176MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:0D:00.0 Off |                  N/A |
| 28%   32C    P8     8W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

The second gpu still shows only 10 percent usage…

also, my model is not training very well, when training on multiple gpus using data parallel, how should i change my learning rate?

That’s a bit strange. Could you post exactly the code you’ve executed so that I could run it on a machine? As explained, I had to manipulate your last code snippet a bit in order to run it and am afraid we might not compare the same codes.