1 GPU with nn.DataParallel vs 1 GPU with simple net

Hi there! I have net for regression which I would like to train on multiple GPUs. The thing is that the Spearman correlation which I get on 1 GPU simple net is different from the results I obtained with torch.nn.DataParallel(self.net, device_ids=[0]) with the same batch size= b and lr=lr. Moreover, when I take batch size k*b on a simple net and scale lr linearly, I got the same correlation, so the linear scale law holds; but it doesn’t take place in the torch.nn.DataParallel(self.net, device_ids=[0]) setup. Here’s the figures:
setup | batch size | lr | correlation
simple | 53 | 2e-5 | 0.9190
dataparallel | 53 | 2e-5 | 0.9064
simple | 206 | 7.7735e-05 | 0.9151
dataparallel | 206 | 7.7735e-05 | 0.8690

It is logically to expect the same behaviour for both setups. What can be the possible problem?