Single forwarding vs double forwarding the same input

cneyang · April 10, 2023, 6:41am

“Why do single forward and double forwarding the same input to compute the same losses make difference?”

Suppose we have two loss terms for updating the model.
We normally forward the input a single time and compute the two loss terms using the output. Then the gradients for each loss term will be added to the model parameters. Like code below.

model = models.resnet50().cuda()
model.fc = nn.Identity()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

x = torch.randn(128, 3, 224, 224).cuda()
y = torch.randn(128, 2048).cuda()

# single forward as usually
for _ in range(10):
    z1 = model(x)
    loss1 = torch.mean(z1)
    loss2 = F.mse_loss(z1, y)
    loss = loss1 + loss2
    print(f"loss1: {loss1.item():.4f}, loss2: {loss2.item():.4f}, loss: {loss.item():.4f}")
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Suppose, for some reason (although it is inefficient and not desired), we forward the same input twice to compute each loss term. As code below.

# double forward
for _ in range(10):
    z1 = model(x)
    z2 = model(x)
    loss1 = torch.mean(z1)
    loss2 = F.mse_loss(z2, y)
    loss = loss1 + loss2
    print(f"loss1: {loss1.item():.4f}, loss2: {loss2.item():.4f}, loss: {loss.item():.4f}")
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

I thought both ways would accumulate the same gradient, however, they behave in a different way even though the random seed is fixed and cuda deterministic is on.
Why do they make difference?

ptrblck · April 10, 2023, 7:03am

(z1 - z2).abs().max() returns tensor(0., device='cuda:0', grad_fn=<MaxBackward1>) for me so you would need to explain your use case and issue in more detail and what exactly you are comparing.