There doesn’t seem to be a lot of existing literatures that I found regarding one loss vs. multiple losses for single input.

One of two outputs, namely “Output 2”, for my architecture is a stack of N images.

Right now, I use a single loss function to “control” this “Output2” (which is a stack of N images). And I have tested several different options. One example is calculating the sub-loss for each of N images, and then take “reduction=‘mean’”, hence returning a single loss2 for each batch.

So currently, I do the following:

out = network(x)
loss = list(map(out, x)) #same as a for-loop to loop over 32 batches
mean_loss = loss/32 # loss.shape == (32, 2)
loss1 = loss[:,0].mean() #for output1
loss2 = loss[:,1].mean() #for output2

But I have been wondering if the model would back propagate differently if I explicitly define Loss2 as multiple N losses, thus increasing accuracy?

out = network(x)
loss = list(map(out, x)) #same as a for-loop to loop over 32 batches
mean_loss = loss/32 # loss.shape == (32,10)
loss1 = loss[:,0].mean() #for output1
loss2 = loss[:,1].mean() #for output2 img 1
loss3 = loss[:,2].mean() # for output2 img 2
...
loss 10 = loss[:,10].mean() for #output2 img 10

I’m not sure how to interpret the “increase accuracy” part so could you explain it a bit more?
Generally, calling backward on a reduced loss vs. indexing the loss first would yield the same gradients if you include the gradient accumulation (you could scale the indexed losses down) as seen here:

model = models.resnet18()
model.eval()
criterion = nn.CrossEntropyLoss(reduction='none')
x = torch.randn(3, 3, 224, 224)
target = torch.randint(0, 1000, (3,))
# single loss
out = model(x)
loss = criterion(out, target)
loss.mean().backward()
print(model.conv1.weight.grad.abs().sum())
# > tensor(369.1813)
# loss indexing
model.zero_grad()
out = model(x)
loss = criterion(out, target)
for idx in range(loss.size(0)):
loss[idx].backward(retain_graph=True)
print(model.conv1.weight.grad.abs().sum() / x.size(0))
# > tensor(369.1814)

I mentioned “increasing accuracy”, because the output1 is a single image and uses a variant of Cross Entropy and it works perfectly, high jaccard index, etc.

Hence, my intuition is that for output2 (which is a stack of N images), I can think of it merging the stack of N images as a single image and fit a similar variant of Cross Entropy (like the one used for output1). But the jaccard index is low, lots of false positives.

So this makes me wonder if treating output2 (the stack of N images) using N cross entropy loss individually would improve the jaccard index.

I hope this clarifies my thoughts a bit more! Thank you!