Memory overflowing due to larger dataset even though training batchsize is 1

Hi Pytorch Team,

I have written down an over-simplified form of my framework.

When I call the “framework_eval_func” function, in each iteration, a single input is moved to the GPU at a time. Then, the network performs computations on the GPU, and the evaluation loss is calculated based on it. What I hoped to see next is the amount of GPU memory being consumed was the same no matter what the size of the dataset was. This is because I always work with one example at a time on the GPU. However, when I increased the size of the dataset, I ran out of memory.

Few things I already checked are-

  1. Use “with torch.no_grad():” to avoid computation of the gradients
  2. Use “loss_vect.append(loss.cpu().item())” to avoid saving references to tensors in GPU memory
  3. Use “x.detach()” to avoid computing a computation graph

I am now out of ideas. I wanted to know if this is an expected behavior.

class framework():

    def __init__(self, net):
       
        self.net = net.cuda()
.
.

	def framework_eval_func(self, x, y):

		# Number of samples in the data
		N = x.shape[0]
		Ids = np.arange(N)

		# Shuffle the set of indices;
		np.random.shuffle(Ids)

		# Evaluating the entire validation set
		loss_vect = []
		with torch.no_grad():
			for index in range(Ids ):
				bx = x[index,:]  # the input
				by = y[index,:]  # the labels

				# move it to the gpu;
				bx = bx.cuda()  # <~~~ A single example is moved
				by = by.cuda()  # <~~~ to the GPU at each iteration

				loss = eval_loss(bx.detach(), by.detach(), self.net)
				loss_vect.append(loss.cpu().item())

		 loss_avg = np.mean(loss_vect)
	     return loss_avg
    
    def framework_train_func(self, x, y):
         .
         .
.
.

You are right that the GPU memory usage would depend on the data on the GPU only and should not be influenced by a dataset in the host RAM.
Could you post a minimal, executable code snippet showing this behavior with a random dataset, so that we could debug it, please?

Dear ptrblck,

Thank you for replying. After you said that the GPU usage should not change I commented out the problematic part as shown below, and indeed there was no change.

Due to the nature of the code, and the dataset dependency I feel it would be very difficult for me to provide you with an executable snippet. Part of the reason is that there are various dependencies that the net is using.

What the code is doing is first apply a CNN network to the input followed by conjugate gradient steps on the CNN output. This conjugate gradient step needs an operator which is “EncObj”. To make this operator we require csm and toep which is dependent on the input data.

class net(nn.Module):

    def __init__(self, EncObj, CNN, ..other parameters..):

	    self.cnn = CNN
	    self.EncObj = EncObj
	    self.opcomp = None

    def ConjGrad(self, x, ..other parameters..):		
	    .
	    x = self.EncObj.apply_AdagA_Toeplitz(x)
	    .
	    .
	
    def forward(self, x):
	
	    # apply the CNN;
	    xm = self.cnn(xp)	

	    #**********************  PROBLEMATIC PART **************************************
	    # perform conjugate gradient on the cnn output
	    csm = self.opcomp[0].cuda()
	    toep = self.opcomp[1]  # This is a list of numpy objects which needs to be put on the GPU
	    for j in range(len(toep)):
		    toep[j] = toep[j].cuda()
	    self.EncObj.NUFFT = MriSenseNufft(smap=csm, ..other parameters..)  # an imported function 
	    self.EncObj.AdjNUFFT = AdjMriSenseNufft(smap=csm, ..other parameters..)  # an imported function 
	    self.EncObj.ToepNUFFT = ToepSenseNufft(csm)  # an imported function 
	    self.EncObj.AdagA_toep_kernel_list = toep
	    xm = self.ConjGrad(xm, ..other parameters..)
	    #*******************************************************************************

	    return xm

# Calling the net from some other evaluating code
.
with torch.no_grad():
    for index in range(Ids ):
        bx = x[index,:]    # the input
        by = y[index,:]    # the labels
        bop = op[index,:]  # data dependent operator
        net.opcomp = bop
        l = loss(net(bx.detach.()),by)
        .
        .
    .
    .
.
.

I tried deleting the variables csm and toep after performing the conjugate gradient steps because we do not need it anymore until the next iteration where a new csm and toep initialized anyways. It did not help and this problematic snippet kept on accumulating GPU memory. If there is a blatant mistake I am doing please let me know.

Update:

I discovered the part which was causing the problem in the above code as marked. “toep” is a list of numpy objects. In order to use it in my code, I had to convert its elements to cuda objects. I could not find an elegant way to do this so I just looped over it. Now this variable, for some reason, was accumulating GPU storage and not releasing it back. I tried deleting it but it did not work.

.
.
	    #**********************  PROBLEMATIC PART **************************************
	    .
	    .
	    .
	    for j in range(len(toep)):     # <~~ This part was accumulating 
		    toep[j] = toep[j].cuda()   # <~~ GPU memory
	    .
	    .
	    . 
	    .
	    .
	    #*******************************************************************************
.
.

I could not find an elegant solution so what I found worked is by taking this variable back to the CPU.

 .
 .
	    #**********************  PROBLEMATIC PART **************************************
	    .
	    .
	    .
	    for j in range(len(toep)):     # <~~ This part was accumulating 
		    toep[j] = toep[j].cuda()   # <~~ GPU memory
	    .
	    .
	    for j in range(len(toep)):     # <~~ I solved it by  
		    toep[j] = toep[j].cpu()  # <~~ taking it back to the cpu
	    .
	    .
	    #*******************************************************************************
.
.

However, I still have questions left. Why was this variable accumulating GPU memory? And please let me know if there is an better alternative solution to achieve the same task.

I’m not sure if you were seeing the increased GPU memory usage in each iteration of the training and where toep is initialized. It would be expected to see an increased GPU memory usage in this loop alone:

for j in range(len(toep)):
    toep[j] = toep[j].cuda()

since you are moving data to the device.
However, removing toep should release the memory again as seen here:

toep = [torch.randn(1024) for _ in range(1024)]

print(torch.cuda.memory_allocated()/1024**2)
# > 0.0

for j in range(len(toep)):
    toep[j] = toep[j].cuda()
    
# 1024 * 1024 * 4 = 4MB expected
print(torch.cuda.memory_allocated()/1024**2)
# > 4.0

# free it
del toep
print(torch.cuda.memory_allocated()/1024**2)
# > 0.0

Are you increasing the length of toep somehow or are storing references to the tensors somewhere else, which could disallow their deletion?

As far as I know, I am not changing the dimension of toep. The dimensions of this list in my case is shown below which I observed to be uniform across all input data.

toep.len(): 30
toep[0].shape: torch.Size([12, 2, 320, 320])

I am not aware if I am consciously storing references to this variable so that its deletion is prevented. However, I am fairly confident that after taking this variable back to CPU as shown above, it really solves the problem (or workarounds)

This is my workflow

Our operator which we apply in the CG iterations depends on the data. This operator is computed just before we start our training process. But because we are limited with our GPU memory, we store components of this operator in CPU memory. We only call it in GPU for the respective batch of data when we run our network.

# Creating our operator 
for i in range(total_data): 
   csm = load_coil_map(csmsfolder, csmfname)
   csm = torch.tensor(csm).cuda()       ​
   ​enc_obj = Dyn2DRadEncObj(csm , .. other parameters ..)
   ​for j in range(len(enc_obj.AdagA_toep_kernel_list)):
       ​enc_obj.AdagA_toep_kernel_list[j] = enc_obj.AdagA_toep_kernel_list[j].cpu()
   ​op[i] = [csm_tensor.cpu(), enc_obj.AdagA_toep_kernel_list]

Next, we call our net during our training process. Notice we set the opcomp attribute of the net with the respective op index.

with torch.no_grad():
    ​for index in range(Ids): 
       bx = x[index,:]    # the input
       by = y[index,:]    # the labels
       bop = op[index,:]  # data dependent operator
       ​net.opcomp = bop
       ​l = loss(net(bx.detach.()),by)
       ​.
       ​.
   ​.
   ​.
​.

Inside the network we assign it to csm and toep just before doing the conjugate gradient. After that, we do a backpropagation to learn the netwok weights.

   ​for mb_indx in range(mini_batch_size):
       csm = self.opcomp[0].cuda()
   ​    toep = self.opcomp[1]
       .
       xm = self.ConjGrad(xm, ..other parameters..)

.

I am sorry if it sounds like I am complicating things but our problem is that the CG function at the moment only works with mini-batch = 1. For mini batchsize more that one, we call a loop for the conjugate gradient, and self.opcomp is the list of operator components corresponding to the required batch.

This would mean that you would only free the GPU memory used by all tensors stored in toep, no?
Could you check the allocated memory before and after pushing these tensors to the CPU and compare it to the expected memory release?
If it fits the tensor shapes, I don’t see any unexpected issues.