“Why do single forward and double forwarding the same input to compute the same losses make difference?”

Suppose we have two loss terms for updating the model.

We normally forward the input a single time and compute the two loss terms using the output. Then the gradients for each loss term will be added to the model parameters. Like code below.

```
model = models.resnet50().cuda()
model.fc = nn.Identity()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
x = torch.randn(128, 3, 224, 224).cuda()
y = torch.randn(128, 2048).cuda()
# single forward as usually
for _ in range(10):
z1 = model(x)
loss1 = torch.mean(z1)
loss2 = F.mse_loss(z1, y)
loss = loss1 + loss2
print(f"loss1: {loss1.item():.4f}, loss2: {loss2.item():.4f}, loss: {loss.item():.4f}")
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

Suppose, for some reason (although it is inefficient and not desired), we forward the same input twice to compute each loss term. As code below.

```
# double forward
for _ in range(10):
z1 = model(x)
z2 = model(x)
loss1 = torch.mean(z1)
loss2 = F.mse_loss(z2, y)
loss = loss1 + loss2
print(f"loss1: {loss1.item():.4f}, loss2: {loss2.item():.4f}, loss: {loss.item():.4f}")
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

I thought both ways would accumulate the same gradient, however, they behave in a different way even though the random seed is fixed and cuda deterministic is on.

Why do they make difference?