Neural style transfer tutorial running out of cuda memory on V100

Hi,

I’m trying to run the neural style transfer example but am running out of memory:

~/.conda/envs/ONE/lib/python3.7/site-packages/torch/cuda/init.py in _lazy_new(cls, *args, **kwargs)
493 # We need this method only for lazy init, so we can remove it
494 del _CudaBase.new
–> 495 return super(_CudaBase, cls).new(cls, *args, **kwargs)
496
497

RuntimeError: CUDA error: out of memory

I have 4 V100 (16GB) available, and am using device #2:

print(device)
device(type='cuda', index=2)

nvidia-smi shows that I’m nowhere close to exhausting the memory of device #2 when the error happens:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59 Driver Version: 390.59 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:61:00.0 Off | 0 |
| N/A 66C P0 262W / 300W | 16139MiB / 16160MiB | 94% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:62:00.0 Off | 0 |
| N/A 30C P0 40W / 300W | 11MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:89:00.0 Off | 0 |
| N/A 31C P0 54W / 300W | 1063MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:8A:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 11MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 74738 C python 16128MiB |
| 2 60794 C …0/u62/ivoliv/.conda/envs/ONE/bin/python 1052MiB |
±----------------------------------------------------------------------------+

Is this to be expected?

Could you try to run your code with CUDA_VISIBLE_DEVICES=2 python your_script.py or alternatively try to set torch.cuda.set_device(2) right after import torch.
It might be that still some CUDA context is initialized on the default GPU (GPU0) if you try to push your model to another device.

just to add to @ptrblck’s answer, which version of pytorch are you using? There was a bug in <= 0.4.1 versions that pytorch uses small amount of memory in GPU-0 when you do not use torch.cuda.set_device() or CUDA_VISIBLE_DEVICES=x. If you are using version >= 0.5.0, this bug is fixed I guess.

Hi. Thanks for the replies. I wasn’t having much luck with gpu 2, so i killed the process running on gpu 0 to see if that would free up the memory for the job. Got this error:

RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:272

So I finnaly just allowed the process to find the default gpu:

device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)

and that uses the default gpu 0 and it works. BTW, torch version is 0.4.1.post2.

Thanks.

Do you push some parameters to a specific GPU?
Would it be possible to post some code so that we could have a look, where the device mix-up occurs.

If you have learnable parameters/dynamic buffers specific to your classes or some tensors initialized statically inside your classes (eg., self.tensor = torch.arange()), make sure to register_buffer() or register_parameter() (depending on your need), so that DataParallel knows what to copy to other GPUs.