-
torch.device('cuda')
will use the default CUDA device. It should be the same ascuda:0
in the default setup. However, if you are using a context manager as described in this example (e.g.with torch.cuda.device(1):
),'cuda'
will refer to the specified device. - In the default context, they will be the same. However, I think
input.cuda()
will also behave like the default device as in point 1. I would recommend to stick to the.to()
operator, as the code is quite easy to be written in a device-agnostic way. - I’m unfortunately not familiar with
torchtext
, but based on the doc, your suggestion makes sense. Let’s wait for other answers on this point. - Yes, that’s right. You’ll see an unbalanced GPU usages as beautifully explained by @Thomas_Wolf in his blog post.
- Regarding
nn.DistributedDataParallel
I try to stick to the NVIDIA apex examples. I’m currently not sure, if there is still a difference between theapex
andPyTorch
implementation ofDistributedDataParallel
or if they are on par now. Maybe @mcarilli or @ngimel might have an answer for this point. - I’m not sure and would guess not. However, I’ve seen some paper explaining the
momentum
might be adapted for large batch sizes. Take this info with a grain of salt and let’s hear other opinions.
7 Likes