Unknown source of out of memory

sohaib_attaiki · January 9, 2021, 10:39am

Hi,

I’m implementing a network, that in the forward loop does some computation with some intermediate variables, that are deleted after using del. However, during the training phase, even if I’m using a batch size of 1, I got an out of memory error, and I don’t know the cause of this, but I’m guessing it’s related to gradient, because in the validation phase, I don’t have any error.

My code looks like:

C = my_subnetwork1(input_data)
# allocated memory 9 Gb | used memory 0.1 Gb
for i in range(5):
    D = some_arithmetic_computation(C, i)
    C = torch.softmax(D, 0)
    del D

return C

however, If I inspect the allocated memory and the used memory at the end of each step of the loop, the used memory keeps increasing if I’m in the training mode, it grows like this: 1 Gb -> 1.97 -> 3.02 -> 4.2.

Can you point me to a direction to solve this?

Thank you in advance!

ptrblck · January 18, 2021, 7:46am

Depending on the used operations the intermediate tensors would be needed to calculate the gradients and you won’t be able to delete them unless you disable the gradient calculation e.g. via wrapping the code in a with torch.no_grad() block.