DataParallel fails in eval mode, a bug?

Hi,
I train a cnn network with batch normalization layers where my loss function is based on L2 distance.
When I use DataParallel to train the network, the evaluation results of the network dramatically fail in eval() mode compared to train() mode.
I understand the difference between eval() and train() and that their performance may differ at a certain level; however, I still think that this issue is related to DataParallel since when I train the model without DataParallel, i.e., with a single gpu, the evaluation results in eval() and train() modes remain almost identical.

2 Likes

Maybe try doubling the batch size?

Thanks for the response. In my case, I cannot increase the batch size due to memory limitations. However, I’ve noticed that when I evaluate the model trained using DataParallel, every change I make to, e.g., the batch size or the number of gpus, significantly degrade the performance even when evaluated in train() mode.

I meet the same problem.
The error information :
File “/home/wen/PycharmProjects/Attention-Echino/train.py”, line 141, in evaluate
output= model(input,target)
File “/home/wen/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/home/wen/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in forward
return self.gather(outputs, self.output_device)
File “/home/wen/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File “/home/wen/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 67, in gather
return gather_map(outputs)
File “/home/wen/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 54, in gather_map
return Gather.apply(target_device, dim, *outputs)
File “/home/wen/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File “/home/wen/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py”, line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: (gather at torch/csrc/cuda/comm.cpp:177)
frame #0: + 0xc48aea (0x7fc338656aea in /home/wen/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #1: + 0x39124b (0x7fc337d9f24b in /home/wen/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: _PyCFunction_FastCallDict + 0x154 (0x563b9f076744 in /home/wen/anaconda3/bin/python)
frame #3: + 0x19842c (0x563b9f0fd42c in /home/wen/anaconda3/bin/python)
frame #4: _PyEval_EvalFrameDefault + 0x30a (0x563b9f12238a in /home/wen/anaconda3/bin/python)
frame #5: + 0x1918e4 (0x563b9f0f68e4 in /home/wen/anaconda3/bin/python)
frame #6: + 0x192771 (0x563b9f0f7771 in /home/wen/anaconda3/bin/python)
frame #7: + 0x198505 (0x563b9f0fd505 in /home/wen/anaconda3/bin/python)