Multiple loss function optimization

Padmaksha_Roy · July 18, 2023, 1:59pm

Hi Team, I have a loss function where I am calculating the loss over multiple domains and summing it up to form the final loss, and for each epoch, I am doing gradient descent on it. Can you plz let me know if the implementation is correct?

def train_AE():

for epoch in range(epochs):
    train_loss = 0
    losses = []

    for domain in range(domains):
        dataset = Dataset(dom=domain)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

        for input, out, targets in dataloader:
            # data = data.to(device)

            input = input.float().to(device) 
            dom_out = out.float().to(device) # ground truth data
 
            targets = targets.long().to(device)
        
            optimizer.zero_grad()

            logits, recon_batch = multitaskAE(input) ## model

            loss_mse = customLoss()
            loss = loss_mse(recon_batch, dom_out, logits, targets)
            losses.append(loss)

    final_loss = torch.sum(torch.stack(losses))
    final_loss.backward()
    train_loss += final_loss.item()
    optimizer.step()

    if epochs % 50 == 0:
        print('====> Epoch: {} Average loss: {:.4f}'.format(
            epochs, train_loss / len(dataloader.dataset)))
        train_losses.append(train_loss / len(dataloader.dataset))

Padmaksha_Roy · July 21, 2023, 4:07am

@ptrblck , can you please have a look at the block of code above? Here, I am trying to implement a loss over multiple domains and the final loss is the sum of all those. Is it the correct way to implement it or is there any better way to do it? Thank you!

ptrblck · July 21, 2023, 5:16am

Your code looks correct, but is accumulating the computation graphs of the entire dataset, which would increase the memory usage. The final_loss.backward() call would then free all stored intermediate tensors and would release the memory. If you run out of memory you could call backward() on each loss inside the DataLoader loop and scale the gradients. While this would reduce the memory the computation would be more expensive since you would call backward multiple times.

Padmaksha_Roy · July 22, 2023, 12:10pm

Thank you so much for clarifying this! I found that torch.mean brings the loss values to a different range but the model performance is not degrading. So torch.mean can be used as well, right?

ptrblck · July 22, 2023, 3:58pm

Yes, taking the mean of the final loss would also work as it would just scale the loss and thus gradients.