# Memory requirements for two similar computations

Something I always wondered about regarding how autograd works behind the scenes, and is relevant for me now. Assume I have a model that does

``````# targets is a [batch, 10], matrix with correct classes id, for 10 "codebooks".
logits = model(inputs) # logits of shape [batch, 10, n_classes]
loss = 0
for i in range(10):
loss += ce_loss(logits[:, i, :], targets[:, i], reduction="sum")
``````

Is the memory requirement during forward / backward pass very different than doing:

``````logits = model(inputs)
loss = ce_less(logtis, targets, reduction="sum")
``````

Note that in this case, obviously there is n reason not to use the latter version. In my actual situation it’s a little more involved to get the logits and targets each into one matrix that would allow computing the loss in one call, therefore the question.

Assume I may be mistaken, ofc.
Mathematically it is equivalent, but computationally is not.

When you have a batched tensor and u backprop, you create a single graph and aggregate the gradients over a batch.

When you have a for loop you are creating N copies of the graph and then aggregate over them. Whereas the result is the same (gradients are sumed together), with the for loop you still need to do the exersize of backprop through the N graphs and store all the necessary info.

In practice it’s like calling a siamese network.

However, d(logits)/d(params) is mutual to add those graphs, so I assume computed only once in both cases?