Why is DataParallel running slower than normal cuda execution?

I have been running this Pytorch example in an EC2 p2.8xlarge instance, which has 8 GPUs.

When I run the code as is (with DataParallel), I get the following benchmark:

real	7m19.136s
user	1m39.732s
sys	3m19.564s

And, when I change the code from this:

if not args.distributed:
    if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
        model.features = torch.nn.DataParallel(model.features)
        model.cuda()
    else:
        model = torch.nn.DataParallel(model).cuda()

to this:

if not args.distributed:
    if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
        model.features = model.features
        model.cuda()
    else:
        model = torch.cuda()
else:
    model.cuda()
    model = model

along with adding .cuda() to all the Variables as instructed here

Now, without DataParallel, these are the benchmarks:

real	6m21.803s
user	1m22.624s
sys	1m26.140s

Model used: alexnet.

What is the reason for this?

I added device_ids as [0,1,2,3,4,5,6,7] too [as the instance has 8 GPUs]. Still takes the same time. Can someone help me regarding this?