Getting cuda out of memory errors with GPU masking

drrobot · October 4, 2017, 6:16pm

I use GPU masking on Ubuntu to switch between training on a Titan X Pascal (12 GB memory) and a GeForce 1080 TI (11 GB memory) with the syntax below.

%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0

Recently, previous PyTorch code that I had no problem running with the GPU masking turned on has been constantly throwing these cuda out of memory errors, even when the exposed GPU in question has plenty of memory capacity.

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66

I even tested with several of the official tutorials from the PyTorch website, and they cause this issue too so it appears not to be an issue with code implementation.

I’m not sure what the issue is, as nothing else has changed on my end. Has there been an update to PyTorch or Cuda that may be behind these?

smth · October 11, 2017, 5:52am

this is super weird. Can you see if the same issue exists with source installs? https://github.com/pytorch/pytorch#from-source

I’m interested in hunting this down.

bamtercelboo · October 13, 2017, 12:19am

the code used all of the GPU memory, the GPU memory is 11G in my server, so it is so small that many time not enough for all user in my group, but 11G is standard.