Hi,
In my case, I calculate loss as following code:

loss = 0
for A,B,C in sequence_of_tensors:
loss += ((A - B)**2 * k1 * (A > C).float() +
(A - B)**2 * k2 * (A < C).float() +
(A - B)**2 * k3 * (A == C).float()).sum()

A and B are tensors. When I put all tensors on GPU, memory usage is about 5GB. But when the computation begins, I occur CUDA out of memory error. My GPU has 12GB memory, which I think is enough. The intermediate results of computation should not be larger than original tensors.
I can’t understand why such computation is so memory expensive. I’ll appreciate any help.

I’d say it depends on the length of sequence_of_tensors.
are all the tensors in sequence of tensors the same length? You can try calculate that batch-wise rather than list-wise.
I would also store A-B as a variable, since you are performing that op 3 times I think it will create 3 graph nodes.
So, in short, I would try to do something like:

A,B,C =stacked_sequence_of_tensors:
tmp = (A-B)**2
loss = (tmp * k1 * (A > C).float() +
tmp * k2 * (A < C).float() +
tmp * k3 * (A == C).float()).sum()

Thanks! Using tmp alleviates the problem. I am surprised by the memory usage in calculation. Just the first execution of loop will lead to out of memory. It might be much more complicated than just storing intermediate results.