Is there a typo in the DataParallel? Device is only set to "cuda:0"

At beginning

device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)

So all the models and data are still just going to the first gpu. Is this a typo, or something deeper going on?

This is expected, as nn.DataParallel will use this GPU (the default device) to store the model and create the copies of the model before copying them to the other specified devices.
This blog post explains the workflow in detail.
Note that this communication overhead (scattering/gathering the data from/to the default device) is also the reason why we recommend to use DistributedDataParallel with a single process per GPU for the best performance.

1 Like

Thanks. From the blog post, for nn.DataParallel, we should be using loss.mean().backward() instead of loss.backward() ?
The official tutorial does not indicate a need to change the backwards pass code.

The loss dimension or size won’t be increased, so no changes would be necessary compared to a single GPU run (assuming your model returns an output and you are calculating the loss “outside” of the model).

1 Like