Cuda out of memory error with recurrent in-place slicing

Issue description

It seems that pytorch has a memory leak when doing in-place slicing with tensors that require gradients. In the code bellow I just sample some data and multiply the sample by a single parameter, then I just leave the biggest values (as in doing k-beam), the autograd does not drop the references properly and it eventually runs out of memory.

Code example

This code runs out of memory on a K80 in GCP

import torch as tr
import os

device = tr.device("cuda:0" if tr.cuda.is_available() else "cpu")
print ('device:', device)

pars = tr.tensor([0.5],requires_grad=True,device=device)

data = tr.arange(1e9,device=device)

def sample(x,sample_size=int(1e5)):
	sample_idx = tr.randint(high=x.size()[0],size=(sample_size,),dtype=tr.long)
	return data[sample_idx]

def leaveTopK(x,k):
	_,idx = tr.sort(x,descending=True)
	x = x[idx]
	return x[:k]

#memory leak loop
out = tr.tensor([],device=device)
for i in range(int(1e4)):
	out =[out,pars*sample(data)])
	out = leaveTopK(out,int(1e5))
	if i%1e3 == 0:
		os.system('nvidia-smi -q --display=MEMORY')

It should not run out of memory since the size of out is kept under 1e5 in the example, but it seems that auto-grad keeps reference to the sliced-out part of the out tensor. When setting requires_grad=False it does not run out of memory.

System Info

Collecting environment information…
PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Debian GNU/Linux 9.5 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 390.46
cuDNN version: Probably one of the following:

Versions of relevant libraries:
[pip3] numpy (1.12.1)
[pip3] torch (0.4.0)
[pip3] torchvision (0.2.1)
[conda] pytorch 0.4.1 py37_cuda9.0.176_cudnn7.1.2_1 pytorch

I installed pytorch by installing anaconda.

I think this is an issue that might be affecting many people and would improve the capabilities of pytorch. Does anybody have an idea on how to solve it?

This is your setup doing it. I don’t think there is much that PyTorch can do for you here, because your operation is not inplace.
When you do pars*sample that requires grad. In turn the cat produces a tensor that requires grad and refers to pars*sample and the previous out as inputs. Then leaveTopK indexes into that and has out as input. etc.
One way to get around that is to keep the indexes into pars and operate with those and the samples (which do not require grad). Then in each iteration rebuild out = cat(pars[idxes-of-samples-you-kept] * samples-you-kept, pars * new_samples) It’s a bit more elaborate but allows PyTorch autograd to forget about the previous step (which it can only if your restart with pars rather than something computed from pars).

Best regards


P.S.: You didn’t ask, but I think it’s canonical to just call the torch module torch.

1 Like

Thank you Thomas, this is very useful. I thought that autograd would keep track of each row separately, but it actually keeps the whole input tensor in memory, which is what I was intending to avoid.