I have a model whose forward pass is as follows(pseudo code):

```
x = stem(x)
o1 = branch1(x)
o2 = branch2(x)
return o1, o2
```

I am using this for jointly training on two datasets as follows(pseudo code):

```
optimizer.zero_grad()
for (x1, y1), (x2, y2) in zip(data1, data2):
o1, _ = model(x1)
l1 = mseloss(o1, y1)
_, o2 = model(x2)
l2 = crossentropy(o2, y2)
total_loss = w1*l1 + w2*l2
total_loss.backward()
optimizer.step()
```

I have a common optimizer with different learning rates for the stem and the two branches.

This methodology should work according to 11648

But i tried changing the order of evaluation to as follows(pseudo code):

```
optimizer.zero_grad()
for (x1, y1), (x2, y2) in zip(data1, data2):
_, o2 = model(x2)
l2 = crossentropy(o2, y2)
o1, _ = model(x1)
l1 = mseloss(o1, y1)
total_loss = w1*l1 + w2*l2
total_loss.backward()
optimizer.step()
```

This led to different results than the previous version. Any idea why that might be happening?

P.S. All random seeds are fixed to maintain reproducibility across runs.