Hi Everyone,
I have the following line of code in Pytorch:

W = torch.matmul(mask, V_scaled_diag)

Let us assume that the mask’s dimension is 2304x768; similarly, V_scaled_diag’s dimension is 768x2304 and W is also of dimension 2304x768. The memory required to do this computation should be:

Mem = 3x2304x768x4 Bytes (assuming Float32) + x Bytes

where x Bytes are required to store the gradients, intermediate tensors etc.

Given the size of matrices, is there an easy way to estimate what this extra usage (value of x) might be?

Also, is it possible to check the occupied memory by each tensor in the computation above?

I would like to understand better how Pytorch allocates memory to intermediate or backprop tensors so that I can justify one implementation compared to another. Do you have any recommendations?

Intermediate activations are stored if a differentiable computation graph is created and if these activations are needed for the gradient calculation during the backward call. The derivatives.yaml file defines which tensors are needed.
Thus it depends a bit on your actual use case and a small example would be great to have.