Huge usage of virtual memory during training

Dawid_S · March 4, 2019, 6:17pm

Suddenly I noticed that the virtual memory usage is huge during my training. The size grows when the first Tensor is passed to GPU. Also, I noticed that using more GPU is much slower than training using one GPU. I am almost certain that the training time on 2 GPUs what almost 2 times faster than using one GPU.

I am using ubuntu, PyTorch 1.0.1 and CUDA 10.
Here I can show htop output of training on one GPU

Does anyone have the idea what’s going on?

Kushaj · March 4, 2019, 9:58pm

Pytorch claims all the memory of your GPU and even when it is not using that memory, if you run nvidia-smi it would show, no memory is free.
Refer to memory management docs for more details.

Dawid_S · March 5, 2019, 10:55am

Excuse me, I am not able to understand the answer. nvidia-smi behaves normally - some of my chosen GPU’s are occupied at the same level as before in terms of RAM.

The simplest code that creates 3 visible lines with more than 20GB virtual memory taken in htop is here:

import torch
import time

cuda = torch.device('cuda')
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)

time.sleep(5)

nlgranger · September 18, 2019, 8:50am

I’d like to bump this up because it seems strange to me as well. Why so much virtual memory?