I think I totally understood the tutorial on using the following line of code to do data parallel.
model = nn.DataParallel(model)
I tested it on my device with 2 GPUs and it worked. However, for a complicated model I recently wrote, it doesn’t truly train on both GPUs when I printed the ‘Inside’ ‘Outside’ debugging message.
I managed to figure out that it was because of the optimizers. To my purpose, I need to have two optimizers each contains some shared part and some unique part.
I deleted anything related to optimizers and loss criterions and it worked just fine on 2 GPUs according to the printed debugging message. But whenever the optimizers and loss criterions are added, it can only be trained on one single GPU.
I wonder if there’s any suggestions on this to make it work?
I’ll post lines that are correlated to this question below.
device = "cuda"
model = nn.DataParallel(model) # For multi-gpu
model = model.module # For multi-gpu
model.to(device)
optimizer_1 = optim.Adam([{'params': model.shared.parameters()},
{'params': model.unique_a.parameters()}],
lr=5e-5
)
optimizer_2 = optim.Adam([{'params': model.encoder.parameters()},
{'params': model.unique_b.parameters()}],
lr=5e-5
)
criterion = nn.L1Loss()
epochs = 30
for epoch in range(epochs):
output = model(...)
loss = criterion(output, target)
loss.backward()
optimizer_1.step()
print("A Outside: output size", output.shape)
Within the model I have a print statement to print “Inside”, and I wish to see “inside” printed twice and “outside” printed once as I have 2 GPUs.
Please note: when I comment out anything related to optimizers and loss, I got the output I wanted. However, when optimizers and loss are included, it runs but I can easily see the output to show that it was only utilizing one GPU.
Please let me know if you need me to post any other piece of code.
Man just now I commented out everything related to optimizers and loss criterions but it suddenly still got the same output. (I swear in my real large project code it appears normal output when disable optimizers and loss.)
But in general, the current output doesn’t look like 2 GPUs are utilized. Any suggestions?
The issue with your code is that you are removing the nn.DataParallel wrapper by calling:
model = model.module # For multi-gpu
so remove this line of code.
After removing it, you would also have to change how the optimizers are created, as you need to either access the internal layers via model.module.fc_x.parameters() or create the optimizers before wrapping the model into nn.DataParallel.
nn.DataParallel assigns the passed model into the .module attribute and uses it internally to push to the different devices etc.
Accessing this attribute via model.module gives you the initial model back and is used to e.g. store the state_dict without the nn.DataParallel wrapper.
However, if you access this attribute and assign it back to model, you are basically just removing the nn.DataParallel usage.