Multi task learning of a branched model

I have a model whose forward pass is as follows(pseudo code):

x = stem(x)
o1 = branch1(x)
o2 = branch2(x)
return o1, o2

I am using this for jointly training on two datasets as follows(pseudo code):

optimizer.zero_grad()

for (x1, y1), (x2, y2) in zip(data1, data2):
   o1, _ = model(x1)
   l1 = mseloss(o1, y1)

   _, o2 = model(x2)
   l2 = crossentropy(o2, y2)

   total_loss = w1*l1 + w2*l2
   total_loss.backward()
   optimizer.step()

I have a common optimizer with different learning rates for the stem and the two branches.
This methodology should work according to 11648
But i tried changing the order of evaluation to as follows(pseudo code):

optimizer.zero_grad()

for (x1, y1), (x2, y2) in zip(data1, data2):
   
   _, o2 = model(x2)
   l2 = crossentropy(o2, y2)   

   o1, _ = model(x1)
   l1 = mseloss(o1, y1)

   total_loss = w1*l1 + w2*l2
   total_loss.backward()
   optimizer.step()

This led to different results than the previous version. Any idea why that might be happening?

P.S. All random seeds are fixed to maintain reproducibility across runs.

If your stem contains batchnorm layers, note that their running estimates will be updated during the forward pass. This should be visible during the evaluation step, although these stats should eventually converge, if the input samples are drawn from the same distribution and are sufficiently large.
Also, dropout layers might be applied differently, if e.g. the batch size differs.

Could you explain a bit more, what issues you are seeing, and where the difference is visible in which magnitude?