Loss not decreasing using nn.DataParallel

I am using nn.DataParallel on 4 GTX 1080 gpus, with

net = Net().cuda()
net = nn.DataParallel(net)
optimizer = torch.optim.Adam(net.parameters())
criterion = nn.CrossEntropyLoss().cuda() 

for training

pred = net(data)
loss = criterion(pred,label)
optimizer = zero_grad()

In this case, the training and validation loss stay around 0.4. While training on a single 1080ti gpu, both loss can reduce to 0.05.

Does anyone know what’s wrong? Any suggestion is appreciated.