I have been running this Pytorch example in an EC2 p2.8xlarge
instance, which has 8 GPUs.
When I run the code as is (with DataParallel
), I get the following benchmark:
real 7m19.136s
user 1m39.732s
sys 3m19.564s
And, when I change the code from this:
if not args.distributed:
if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
model.features = torch.nn.DataParallel(model.features)
model.cuda()
else:
model = torch.nn.DataParallel(model).cuda()
to this:
if not args.distributed:
if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
model.features = model.features
model.cuda()
else:
model = torch.cuda()
else:
model.cuda()
model = model
along with adding .cuda()
to all the Variable
s as instructed here
Now, without DataParallel
, these are the benchmarks:
real 6m21.803s
user 1m22.624s
sys 1m26.140s
Model used: alexnet
.
What is the reason for this?