How to run with multiple GPU when having 2 optimizers?

tcsn_wty · April 10, 2021, 10:08am

I think I totally understood the tutorial on using the following line of code to do data parallel.

model = nn.DataParallel(model)

I tested it on my device with 2 GPUs and it worked. However, for a complicated model I recently wrote, it doesn’t truly train on both GPUs when I printed the ‘Inside’ ‘Outside’ debugging message.

I managed to figure out that it was because of the optimizers. To my purpose, I need to have two optimizers each contains some shared part and some unique part.

model = model.module
optimizer_1 = optim.Adam([{'params': model.shared.parameters()},
                              {'params': model.unique_a.parameters()}],
                             lr=5e-5
                             )
optimizer_2 = optim.Adam([{'params': model.encoder.parameters()},
                              {'params': model.unique_b.parameters()}],
                             lr=5e-5
                             )

I deleted anything related to optimizers and loss criterions and it worked just fine on 2 GPUs according to the printed debugging message. But whenever the optimizers and loss criterions are added, it can only be trained on one single GPU.

I wonder if there’s any suggestions on this to make it work?

Thanks!

ptrblck · April 11, 2021, 3:32am

Could you post an executable code snippet, as the optimizers shouldn’t change the behavior of nn.DataParallel in any way, please?

tcsn_wty · April 11, 2021, 3:51am

I’ll post lines that are correlated to this question below.

device = "cuda"

model = nn.DataParallel(model)  # For multi-gpu
model = model.module  # For multi-gpu
model.to(device)

optimizer_1 = optim.Adam([{'params': model.shared.parameters()},
                              {'params': model.unique_a.parameters()}],
                             lr=5e-5
                             )
optimizer_2 = optim.Adam([{'params': model.encoder.parameters()},
                              {'params': model.unique_b.parameters()}],
                             lr=5e-5
                             )
criterion = nn.L1Loss()
epochs = 30
for epoch in range(epochs):
        output = model(...)
        loss = criterion(output, target)
        loss.backward()
        optimizer_1.step()
        print("A Outside: output size", output.shape)

Within the model I have a print statement to print “Inside”, and I wish to see “inside” printed twice and “outside” printed once as I have 2 GPUs.
Please note: when I comment out anything related to optimizers and loss, I got the output I wanted. However, when optimizers and loss are included, it runs but I can easily see the output to show that it was only utilizing one GPU.

Please let me know if you need me to post any other piece of code.

Thanks!

ptrblck · April 11, 2021, 3:57am

Yes, an executable code snippet would be needed to debug this issue further.

tcsn_wty · April 11, 2021, 4:56am

Here it is.

import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

class Model(nn.Module):
    # Our model
    
    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc_a = nn.Linear(input_size, output_size)  # Unique A
        self.fc_b = nn.Linear(input_size, output_size)  # Unique B
        self.fc_out = nn.Linear(output_size, output_size)  # Shared layer
    
    def forward(self, input, select='A'):
        if select is 'A':
            input = self.fc_a(input)
        else:
            input = self.fc_b(input)
        output = self.fc_out(input)
        print("\tIn Model: output size", output.size())
        return output

if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs

device = "cuda"
model = Model(input_size, output_size)
model = nn.DataParallel(model)  # For multi-gpu
model = model.module  # For multi-gpu
model.to(device)

optimizer_1 = optim.Adam([{'params': model.fc_a.parameters()},
                          {'params': model.fc_out.parameters()}],
                         lr=5e-5
                         )
optimizer_2 = optim.Adam([{'params': model.fc_b.parameters()},
                          {'params': model.fc_out.parameters()}],
                         lr=5e-5
                         )
criterion = nn.L1Loss()

batch_size = 16

print(model)

epochs = 10
for epoch in range(epochs):
    data_a, target_a = torch.randn(batch_size, data_size, input_size), torch.randn(batch_size, data_size, output_size)
    data_b, target_b = torch.randn(batch_size, data_size, input_size), torch.randn(batch_size, data_size, output_size)
    input_a, target_a = Variable(data_a.float()).to(device), Variable(target_a.float()).to(device)
    input_b, target_b = Variable(data_b.float()).to(device), Variable(target_b.float()).to(device)
    
    output_a = model(input_a, select='A')
    loss1 = criterion(output_a, target_a)
    loss1.backward()
    optimizer_1.step()
    print("Outside: output size", output_a.size())
    
    output_b = model(input_b, select='B')
    loss2 = criterion(output_b, target_b)
    loss2.backward()
    optimizer_2.step()
    print("Outside: output size", output_b.size())

Sorry for lacking of comments, please let me know if any explanation is needed.

By running this piece of code, I got the following kind of output:

        In Model: output size torch.Size([16, 100, 2])
Outside: output size torch.Size([16, 100, 2])
        In Model: output size torch.Size([16, 100, 2])
Outside: output size torch.Size([16, 100, 2])
        In Model: output size torch.Size([16, 100, 2])
Outside: output size torch.Size([16, 100, 2])

But I was expecting the following:

        In Model: output size torch.Size([8, 100, 2])
        In Model: output size torch.Size([8, 100, 2])
Outside: output size torch.Size([16, 100, 2])

tcsn_wty · April 11, 2021, 5:02am

Man just now I commented out everything related to optimizers and loss criterions but it suddenly still got the same output. (I swear in my real large project code it appears normal output when disable optimizers and loss.)
But in general, the current output doesn’t look like 2 GPUs are utilized. Any suggestions?

FYI, I do have 2 GPUs available.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:05:00.0 Off |                  N/A |
| 27%   26C    P8     9W / 250W |      1MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:06:00.0 Off |                  N/A |
| 27%   25C    P8     8W / 250W |      1MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

ptrblck · April 11, 2021, 6:21am

The issue with your code is that you are removing the nn.DataParallel wrapper by calling:

model = model.module  # For multi-gpu

so remove this line of code.
After removing it, you would also have to change how the optimizers are created, as you need to either access the internal layers via model.module.fc_x.parameters() or create the optimizers before wrapping the model into nn.DataParallel.

tcsn_wty · April 11, 2021, 7:59am

Thanks a lot! That solves the whole problem.

A quick follow-up question: why this line of code does the removing on nn.DataParallel wrapper? I wasn’t expecting that to happen.

ptrblck · April 11, 2021, 8:40pm

nn.DataParallel assigns the passed model into the .module attribute and uses it internally to push to the different devices etc.
Accessing this attribute via model.module gives you the initial model back and is used to e.g. store the state_dict without the nn.DataParallel wrapper.
However, if you access this attribute and assign it back to model, you are basically just removing the nn.DataParallel usage.