Getting cuda out of memory errors with GPU masking

I use GPU masking on Ubuntu to switch between training on a Titan X Pascal (12 GB memory) and a GeForce 1080 TI (11 GB memory) with the syntax below.


Recently, previous PyTorch code that I had no problem running with the GPU masking turned on has been constantly throwing these cuda out of memory errors, even when the exposed GPU in question has plenty of memory capacity.

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/

I even tested with several of the official tutorials from the PyTorch website, and they cause this issue too so it appears not to be an issue with code implementation.

I’m not sure what the issue is, as nothing else has changed on my end. Has there been an update to PyTorch or Cuda that may be behind these?

this is super weird. Can you see if the same issue exists with source installs?

I’m interested in hunting this down.

the code used all of the GPU memory, the GPU memory is 11G in my server, so it is so small that many time not enough for all user in my group, but 11G is standard.