Multiple loss backward

a = cross_entropy(output, label)
b = nllloss(output, label)
a.backward(retain_graph=True)
b.backward()

Memory used by CUDA will increase until memory out.

However

a = cross_entropy(output, label)
a += nllloss(output, label)
a.backward()

The problem mentioned before will not appear.

The peak memory usage would be higher in the first example, since the gradients for each parameters will be calculated a second time and accumulated to the already calculated .grad attribute.
Besides that the allocated memory should be equal as seen here:

device = 'cuda'
model = models.resnet50().to(device)
x = torch.randn(1, 3, 224, 224, device=device)
label = torch.randint(0, 1000, (1,), device=device)
output = model(x)

if False:
    a = F.cross_entropy(output, label)
    b = F.nll_loss(output, label)
    a.backward(retain_graph=True)
    b.backward()
else:
    a = F.cross_entropy(output, label)
    a += F.nll_loss(output, label)
    a.backward()

print(torch.cuda.memory_summary())