https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
At beginning
device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)
So all the models and data are still just going to the first gpu. Is this a typo, or something deeper going on?
This is expected, as nn.DataParallel
will use this GPU (the default device) to store the model and create the copies of the model before copying them to the other specified devices.
This blog post explains the workflow in detail.
Note that this communication overhead (scattering/gathering the data from/to the default device) is also the reason why we recommend to use DistributedDataParallel
with a single process per GPU for the best performance.
1 Like
Thanks. From the blog post, for nn.DataParallel
, we should be using loss.mean().backward()
instead of loss.backward()
?
The official tutorial does not indicate a need to change the backwards pass code.
The loss dimension or size won’t be increased, so no changes would be necessary compared to a single GPU run (assuming your model returns an output and you are calculating the loss “outside” of the model).
1 Like