How to avoid CUDA out of memory

I use item() and del out1 to avoid CUDA out of memory .
But still occurs.
Is there any tips for avoid out of memory?
Here is my training flow.

optimizer.zero_grad()
out1 = network(input)
loss1 = criterion(out1, target1
loss += loss1.item()
del out1

out2 = network(input)
loss2 = criterion(out2, target)
loss += loss2.item()
del out2

loss.backward()

Did you already reduced the batch size?

Sure, I don’t want to decrease batch size anymore.

What is your input? Images?

If yes, you can resize then.

I printed memory_allocated.
The memory for loss1 and loss2 are accumulated in the last.
After getting loss1, Can I remove this memory before getting loss2 ?

optimizer.zero_grad()
out1 = network(input)
loss1 = criterion(out1, target1
loss += loss1.item()
del out1
print(torch.cuda.memory_allocated(0)) # 2713555968

out2 = network(input)
loss2 = criterion(out2, target)
loss += loss2.item()
del out2
print(torch.cuda.memory_allocated(0)) # 5277173248

loss.backward()

The memory after DEL operation don’t return to the device.

You need to use the same variable for same outputs and DEL this variable.

I think that you can do this:

optimizer.zero_grad()
out = network(input)
loss1 = criterion(out, target1)
del out
out = network(input)
loss2 = criterion(out, target2)
del out
loss = loss1 + loss2
loss.backward()

You have to reduce your bath size. But if you reduce your batch size further this might affect model training performance. So in this case you should use grading accumulation technique. Using which your model’s loss with small batch size would converge as if the batch size if larger.

Let’s say you want to set your batch size to “original_batch_size” but you are getting stable training without any “cuda out of memory” for a smaller batch of “new_batch_size”, for that you can train your model in following way:

# clear last step
optimizer.zero_grad()
accumulations = original_batch_size / new_batch_sizs

scaled_loss = 0
for accumulated_step_i in range(accumulations):
     out = model.forward()
     loss = some_loss(out,y)    
     loss.backward()
     scaled_loss += loss.item()
      
# now update the model.... So affective batch size will be original_bath_size * accumulations
optimizer.step()

# loss is now scaled up by the number of accumulated batches
actual_loss = scaled_loss / accumulations

More about it here

1 Like

@braindotai
Thanks for answer. Is this correct way ?
Does actual_loss needed to backward()?

optimizer.zero_grad()
accumulations = 2

scaled_loss = 0
for accumulated_step_i in range(accumulations):
     total_loss = 0
     for idx, (data, target) in train_loader:
          out1 = network(input)
          loss1 = criterion(out1, target1
          total_loss += loss1.item()

          out2 = network(input)
          loss2 = criterion(out2, target)
          total_loss += loss2.item()
          total_loss.backward() # backward here!

          scaled_loss += loss.item()

optimizer.step()

# actual_loss = scaled_loss / accumulations ?
# actual_loss.backward() ?

No need to call .backward() for actual_loss, in fact no need to calculate it as it ain’t affecting training.
Though in your case I think the training loop should look like -

accumulations = 2
scaled_loss = 0
optimizer.zero_grad()
epochs = 20
training_steps_losses = []
for epoch in range(epochs):
    for idx, (data, target) in train_loader:
        out1 = network(input)
        loss1 = criterion(out1, target1)
        total_loss = loss1.item()

        out2 = network(input)
        loss2 = criterion(out2, target)
        total_loss = loss2.item()

        total_loss /= accumulations
        total_loss.backward()
        # Here you will calculate gradients.
        # In usual case we call optimizer.step() right after this. But not in this case.
        # We are dividing the total_loss by accumulations in order to have same scale of gradients
        # before calling optimizer.step()
        
        scaled_loss += total_loss.item() # not required for training, its only used to monitor loss as we update the parameters

        # In this case we will only call optimizer.step() when batch index (idx) + 1
        # is divisible by accumulations.
        # The main idea is we call .backward() for accumulations number of times,
        # doing this adds gradients for all the parameters #(since we are not calling optimizer.zero_grad() every time we call total_loss.backward())
        # for accumulations number of times.
        # And after that we call optimizer.step_grad() followed by optimizer.zero_grad()

        if (idx + 1) % accumulations == 0:
            optimizer.step()
            optimizer.zero_grad()
            training_steps_losses.append(scaled_loss) # no need to divide scaled_loss here since we are already scaling the total_loss via dividing it by accumulations.
            scaled_loss = 0.0

# And after training is done you can plot a graph between training iterations and losses:
plt.plot(training_steps_losses, label = 'Training Loss') # here only is the use of scaled_loss
plt.xlabel('Training Iteration')
plt.ylabel('Loss')
plt.show()

Once again, calculate loss and parameters gradients for accumulations number of next batches, add those gradients of parameters respectively. And then update the parameters.

And sorry for not including these explanations in the first place.

1 Like