Pytorch Data Parallel Not Using Second GPU

Hi,

I am using data-parallel across two GPUs.

Also, my second GPU is not used:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88       Driver Version: 418.88       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:26:00.0 Off |                  N/A |
| 27%   62C    P2    76W / 280W |   9523MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:27:00.0 Off |                  N/A |
|  0%   31C    P8    11W / 280W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24     C   python                                      9513MiB |
+-----------------------------------------------------------------------------+

How to make sure the second gpu is also used? I am wrapping my model in data parallel.

What model and data are you using?
Could you post your code (stripped down) so that we could have a look?

thanks for the reply,

my model is very simple:

class Net(nn.Module):
    def __init__(self, vocab_size):
        super(Net, self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = 300
        self.hidden_size = 300
        self.linear = nn.Linear(1024, 300)
        self.dropout_rate = 0.5
        self.embedding = nn.Embedding(self.vocab_size, self.embed_size)
        self.dropout = nn.Dropout(self.dropout_rate)
        self.LSTM = nn.LSTM(self.embed_size, self.hidden_size, bidirectional=True)
        self.multimodal_linear = nn.Linear(600, 2)

    def forward(self, s, c):
        r = self.linear(c)
        self.LSTM.flatten_parameters()
        embedded = self.embedding(s)
        embedded = self.dropout(embedded)
        # Each batch has the same maxlen, how to make data loader with custom maxlen?
        input_lengths = [sent.shape[1]]* sent.shape[0]
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True)
        output, hidden = self.LSTM(packed, None)
        output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
        bi_text = hidden[0][0,:,:].squeeze()
        o = self.m_linear(torch.cat((bi_text, r), dim=1)) 
        return o

This is wrapped by another class, which has some other functions as well.

and in that class, say testNet

self.net = Net(..)

and I use testNet = nn.DataParallel(testNet)

and use testNet.module.parameters as optimizer parameters.

@ptrblck hi, should i add any other details…could this be related to dataloader? I tried very large batch size as well like 2048 * 4 but still second gpu is not utilized

Thanks for the code!
I’ve just tried to run it on our machine and see all GPUs are used:

class Net(nn.Module):
    def __init__(self, vocab_size):
        super(Net, self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = 300
        self.hidden_size = 300
        self.linear = nn.Linear(1024, 300)
        self.dropout_rate = 0.5
        self.embedding = nn.Embedding(self.vocab_size, self.embed_size)
        self.dropout = nn.Dropout(self.dropout_rate)
        self.LSTM = nn.LSTM(self.embed_size, self.hidden_size, bidirectional=True)
        self.multimodal_linear = nn.Linear(600, 2)

    def forward(self, s, c):
        r = self.linear(c)
        self.LSTM.flatten_parameters()
        embedded = self.embedding(s)
        embedded = self.dropout(embedded)
        # Each batch has the same maxlen, how to make data loader with custom maxlen?
        input_lengths = torch.tensor([10]*s.size(0)).long() #[sent.shape[1]]* sent.shape[0]
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lengths, batch_first=True)
        output, hidden = self.LSTM(packed, None)
        output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
        bi_text = hidden[0][0,:,:].squeeze()
        o = self.multimodal_linear(torch.cat((bi_text, r), dim=1))
        return o


device = 'cuda'

N = 64*64
model = Net(100).to(device)
model = nn.DataParallel(model)
s = torch.randint(0, 100, (N, 10)).to(device)
c = torch.randn(N, 1024).to(device)

for _ in range(100):
    out = model(s, c)
print(out)

I had to fix some minor issues to run the code (self.m_linear should probably be self.multimodel_linear in the forward).
Could you run my code and check, if all GPUs are utilized?

thanks @ptrblck,

I tried the code, I still get:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.88       Driver Version: 418.88       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:0C:00.0 Off |                  N/A |
| 44%   69C    P2   247W / 250W |   2579MiB / 11176MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:0D:00.0 Off |                  N/A |
| 28%   32C    P8     8W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

The second gpu still shows only 10 percent usage…

also, my model is not training very well, when training on multiple gpus using data parallel, how should i change my learning rate?

That’s a bit strange. Could you post exactly the code you’ve executed so that I could run it on a machine? As explained, I had to manipulate your last code snippet a bit in order to run it and am afraid we might not compare the same codes.

thanks @ptrblck:

the difference is that my wrapper (model object) is not a subclass from nn.Module . so, my model is model.net.module , which i wrap around the data parallel api.

If you don’t want to derive from nn.Module, you might have to implement the parallel calls manually using the functional API, as the vanilla nn.DataParallel uses the nn.Module.forward method to chunk the data etc.