Same code meet out of memory problems in Pytorch.0.4

lxtGH · May 17, 2018, 7:54am

The problem is when doing validation , the input image is 1024 * 2048 , 4 images for 4 1080ti gpu (each gpu per image). I met this error.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File “train_cityscape.py”, line 229, in
train(cfg)
File “train_cityscape.py”, line 141, in train
outputs = model(images_val)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 114, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py”, line 65, in parallel_apply
raise output
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py”, line 41, in _worker
output = module(*input, **kwargs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lxt/project/seg-pytorch/model/denseaspp121.py”, line 114, in forward
feature = self.features(_input)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py”, line 91, in forward
input = module(input)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py”, line 91, in forward
input = module(input)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lxt/project/seg-pytorch/model/denseaspp121.py”, line 237, in forward
new_features = super(_DenseLayer, self).forward(x)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py”, line 91, in forward
input = module(input)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py”, line 49, in forward
self.training or not self.track_running_stats, self.momentum, self.eps)
File “/home/lxt/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py”, line 1194, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

However, in pytorch 0.3.1, the code can run as usual.

lxtGH · May 17, 2018, 9:01am

It was very confused, I found that a singe image of size 3 * 1024 * 2048 with GTX 1080ti(11GB memory) can not’t handle than with fcn like CNN model based on densenet (Pytorch 0.40) However, pytorch 0.3.1 can handle it

ptrblck · May 17, 2018, 9:01am

Have you wrapped your validation code into torch.no_grad()?

lxtGH · May 17, 2018, 9:20am

Thank you very much ! ! I just use pytorch 0.4 for a day. Problem solved by that command, so the default setting of each tensor is need gradient calculation?

ptrblck · May 17, 2018, 11:20am

The default settings weren’t changed. Just the volatile flag is deprecated.
Instead of this flag, some context managers were introduced.
Have a look at the Migration Guide for more examples.