CUDA 10.2 Out of memory

I run a model in GTX 1080 Ti with cuda 10.2 and pytorch 1.5. When I synthesize audio output, I use “with torch.no_grad(), torch.backends.cudnn.deterministic = False, torch.backends.cudnn.benchmark = False, torch.cuda.set_device(0), torch.cuda.empty_cache(), os.system(“sudo rm -rf ~/.nv”)” but GPU memory is still increased. Each time it increase about 10 MiB until out of memory.
Can you help me to solve this problem? Thanks very much.

If I understand the issue correctly, your memory usage is increasing in each iteration.
This might happen if you are storing tensors, which are still attached to the computation graph, in e.g. a list.
Often you would like to append the loss to a list in order to calculate the mean for the epoch.
Since the loss tensor is attached to the computation graph, you would also store the complete graph in each iteration, which might eventually yield the OOM issue.

To detach the tensor properly, you could use:

losses.append(loss.cpu().detach().item())

The memory increase should not come from the options you’ve posted.

The problem is in synthesizing, not training so I think it does not involve to loss.

Could you post your code, so that we can have a look?
None of the mentioned settings should change the memory behavior and of course shouldn’t create / avoid a memory leak.

Here is my code to synthesize audio from mel spectrogram
pad_fn = torch.nn.ReplicationPad1d(self.config_gan[“generator_params”].get(“aux_context_window”, 0)).to(torch.device(“cuda”))
#Generative
with torch.no_grad():
c = self.scaler.transform(mels[0])
x = ()
z = torch.randn(1, 1, len(c ) * self.config_gan[“hop_size”]).to(torch.device(“cuda”))
x += (z,)
b = torch.from_numpy(c ).unsqueeze(0).transpose(2, 1).to(torch.device(“cuda”))
c = pad_fn(b)
x += (c,)
y = self.model_gan(*x).view(-1).cpu().numpy()
y = y[:len(y)-3000]
del z
torch.cuda.empty_cache()
del b
torch.cuda.empty_cache()
del c
torch.cuda.empty_cache()
out = io.BytesIO()
audio.save_wav(y, out, sr=hparams.sample_rate)
del y
torch.cuda.empty_cache()
del pad_fn
torch.cuda.empty_cache()
gc.collect()
with open(‘fmemory.txt’, ‘a’) as f:
f.writelines(‘del all ’ + str(torch.cuda.memory_allocated()/10242)+"\n") #5.27490234375
f.writelines(str(torch.cuda.memory_cached()/1024
2) +’\n’) #6.0
self.model_gan.remove_weight_norm()
return out.getvalue()

Are you able to reproduce the memory increase using random data? If so, could you post the input data shape as well as all other shapes, which would be necessary to reproduce this issue?

Here is the memory that I noted when synthesizing a text with 3 sentences.
before inference: allocated 5.10791015625
cached 6.0
inference: allocated 225.40185546875
cached 970.0
after inference allocated 6.48193359375
cached 26.0
del all allocated 6.48193359375
cached 26.0
before inference: allocated 5.10791015625
cached 6.0
inference: allocated 95.818359375
cached 398.0
after inference allocated 5.6748046875
cached 6.0
del all allocated 5.6748046875
cached 6.0
before inference: allocated 5.10791015625
cached 6.0
inference: allocated 31.6435546875
cached 122.0
after inference allocated 5.27490234375
cached 6.0
del all allocated 5.27490234375
cached 6.0

The memory doesn’t seem to grow in each iteration or am I missing something?
Depending on the size of your input the current iteration might need more memory than the previous one, but the memory footprint seems to go down as expected after deleting the tensors.

Yes, the memory does not change in allocate and cache. But GPU memory usage is still increase until out of memory.

How can the GPU yield an OOM, if the allocated and cached memory is reduced in each iteration?
Are you hitting an “extremely large” input sample, which might be too big for your device?

Maybe it is a bug of Pytorch 1.5, I change to Pytorch 1.0.1 and it works fine, not leaking GPU RAM as in Pytoch 1.5.

I’m still unsure, how to interpret this statement:

How does the memory increase, if the allocated and cached memory is not changed?
Are you seeing the increase only via e.g. nvidia-smi?

Yes, I see the increase in nvidia-smi.

Could you post a minimal, executable code snippet, which would show this behavior, so that we could debug it, please?

You can see here: https://github.com/kan-bayashi/ParallelWaveGAN/issues/160#issuecomment-639850145

The code is unfortunately not executable, as you are using private data.
Could you post an executable code snippet using random input data?