While developing a model intended for training to optimize multiple targets at once I realized I could construct the output a couple of different ways and I’m curious if they are equivalent in terms of gradient calculation or if one may be better than the other it terms of convergence…
Assume, for simplicity I have 2 targets.
Option 1:
class MyModel( torch.nn.Module):
def __init__(self):
super().__init__()
self.input_layer = torch.nn.Linear( 1024, 16)
self.out_layer_1 = torch.nn.Linear( 16, 1)
self.out_layer_1 = torch.nn.Linear( 16, 1)
def forward(self, feature_data):
X = self.input_layer( feature_data)
Out_1 = self.out_layer_1( X)
Out_2 = self.out_layer_2( X)
return (Out_1, Out_2)
Option 2:
class MyModel( torch.nn.Module):
def __init__(self):
super().__init__()
self.input_layer = torch.nn.Linear( 1024, 16)
self.out_layer = torch.nn.Linear( 16, 2)
def forward(self, feature_data):
X = self.input_layer( feature_data)
Out = self.out_layer( X)
return Out
Obviously, with option 2 I would need to slice the output for my results. But I’m wondering if this makes a difference with regard to the optimizer and the backpropagation of loss.