I use GPU masking on Ubuntu to switch between training on a Titan X Pascal (12 GB memory) and a GeForce 1080 TI (11 GB memory) with the syntax below.
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0
Recently, previous PyTorch code that I had no problem running with the GPU masking turned on has been constantly throwing these cuda out of memory errors, even when the exposed GPU in question has plenty of memory capacity.
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66
I even tested with several of the official tutorials from the PyTorch website, and they cause this issue too so it appears not to be an issue with code implementation.
I’m not sure what the issue is, as nothing else has changed on my end. Has there been an update to PyTorch or Cuda that may be behind these?