Memory error Batch norm testing

sabyUWO · April 3, 2020, 8:42pm

Hi,
I am getting a memory error during batch norm in the testing phase(model.eval()), changing the batch size (even as low as 1) doesn’t fix it; the training works fine. Can anyone kindly take a look?. The testing size is 64k. Be safe!

File "/home/saby2k13/projects/ctb-ilie/saby2k13/pyDPPI/tune/HyperTuneTwoStreamRPv1/DPPIPSSM128.py", line 514, in <module>
    main()
  File "/home/saby2k13/projects/ctb-ilie/saby2k13/pyDPPI/tune/HyperTuneTwoStreamRPv1/DPPIPSSM128.py", line 393, in main
    results = test(model.eval(),tst_loader,device)
  File "/home/saby2k13/projects/ctb-ilie/saby2k13/pyDPPI/tune/HyperTuneTwoStreamRPv1/DPPIPSSM128.py", line 439, in test
    output = net(crop0, crop1).cuda()
  File "/project/6034601/saby2k13/pyDPPI/pyDPPIv2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/saby2k13/projects/ctb-ilie/saby2k13/pyDPPI/tune/HyperTuneTwoStreamRPv1/DPPIPSSM128.py", line 267, in forward
    output2 = self.forward_once(input2)
  File "/home/saby2k13/projects/ctb-ilie/saby2k13/pyDPPI/tune/HyperTuneTwoStreamRPv1/DPPIPSSM128.py", line 214, in forward_once
    output = self.bn2(output)
  File "/project/6034601/saby2k13/pyDPPI/pyDPPIv2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/6034601/saby2k13/pyDPPI/pyDPPIv2/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 81, in forward
    exponential_average_factor, self.eps)
  File "/project/6034601/saby2k13/pyDPPI/pyDPPIv2/lib/python3.6/site-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 13.97 GiB already allocated; 3.50 MiB free; 733.67 MiB cached)

ptrblck · April 4, 2020, 5:29am

Could you check the memory usage on your GPU via nvidia-smi and make sure no other processes are using memory?
Did you change anything else besides calling model.eval(), e.g. the spatial input shapes?