Criterion with multiple input

111242 · July 29, 2021, 2:46pm

I’m not sure if this is the right place to post this question, but I’m trying to do something like this:

class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(NeuralNet, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size//2),
            nn.ReLU(),
            nn.Linear(hidden_size//2, 1),
        )
    
    def forward(self, x1, x2):
        out1 = self.layers(x1)
        out2 = self.layers(x2)
        out1 = F.sigmoid(out1)
        out2 = F.sigmoid(out2)
        return out1, out2

model = NeuralNet(input_size, hidden_size)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In the training loop:

output1, output2 = model(input1, input2)
if output1> output2:
  label = torch.ones([1,1])
else:
  label = torch.zeros([1,1])
loss = criterion(output1, output2, label)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Doing something like below does not seem to work:

criterion = nn.BCELoss()

output = output1>output2
output.requires_grad=True
loss = criterion(output, label)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Is there a custom loss function that does something like this? Or is BCELoss not a good choice of loss function in this case? I’m trying to train the model to do something like ranking given two input.

ayalaa2 · July 29, 2021, 7:32pm

I don’t know if this will work as you expect. The network is treating input1 and input2 as two independent inputs. While operating on one, it has no knowledge of the other.

Could you explain what the exact goal of your network is? That’ll help us get an idea of what you’re trying to achieve here

111242 · July 29, 2021, 9:56pm

Thank you for the reply. I am trying to have a model that can predict the relative ranking of the data samples in a dataset. For example, given a list of customers who are subscribed to a channel, predict the relative order of canceling the subscription.

ayalaa2 · July 29, 2021, 10:18pm

Is there a specific architecture here that you’re trying to mimic?

111242 · July 30, 2021, 12:18am

No, I am just trying to build a plain DNN model.

ayalaa2 · July 30, 2021, 12:34am

Okay, I’m not sure if this exact approach will work.

The way I’m understanding the network as you laid out, is that you want the network to map your input to a real value in the range [0, 1.0]. It doesn’t make sense to me that you decide the labels based off the model output.

Currently, if input1 was mapped to a larger value, then we are encouraging the network to map these both to larger values. Otherwise if input2 was larger, then we’re encouraging the network to map these both to a smaller value?

I’m not greatly familiar in this specific task, however, if I was to go about doing this, I would have an encoder network that will take in an input and return an embedding x. Then I will have another network which will take in two embeddings and output a value where, after a sigmoid, gets mapped to [0, 1.0]. If this value is closer to 1.0, then this suggests that the first embedding should be ranked higher. Otherwise the second embedding should be ranked higher.

111242 · July 30, 2021, 1:00am

Would a quick sketch of your description look something like this?

class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Encoder, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size//2),
            nn.ReLU(),
            nn.Linear(hidden_size//2, output_size),
        )
    
    def forward(self, x):
        out = self.layers(x)
        return out


class Classifier(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Classifier, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size//2),
            nn.ReLU(),
            nn.Linear(hidden_size//2, 1),
        )
    
    def forward(self, x):
        out = self.layers(x)
        out = F.sigmoid(out)
        return out

Training loop:

criterion = nn.BCELoss()
encoder_net = Encoder(input_size, hidden_size, output_size)
classifier_net = Classifier(input_size, hidden_size)

for e in num_epochs:
    for idx, (data, target) in enumerate(data_loader):
        data1, data2 = data
        output1 = encoder_net(data1)
        output2 = encoder_net(data2)
        if output1 > output2:
            predicted_label = torch.ones([1,1])
        else:
            predicted_label = torch.zeros([1,1)
        
        loss = criterion(predicted_label, target)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

I am still unsure if the criterion in the above “pseudocode” is structured correctly. That is, is it correct to have criterion(predicted_label, target) where criterion = nn.BCELoss()? Would the gradient back-propagate through this loss? If so, how is this different from the one-model architecture I had in my original post? Also, in this case, would I have two optimizers, one per network, and call optimizer1.zero_grad(), optimizer2.zero_grad(), optimizer1.step(), optimizer2.step()?

Thanks a lot!!

ayalaa2 · July 30, 2021, 8:57pm

This is getting closer, but that conditional is still throwing me off. I’ll use the network described in your message.

criterion = nn.BCELoss()
encoder_net = Encoder(input_size, hidden_size, output_size)
classifier_net = Classifier(2 * output_size, hidden_size)  # I'm allocating room for 2 tensors of the same size!

for data, target in data_loader:
    data1, data2 = data

    # first we encode each datapoint into some latent representation
    output1 = encoder_net(data1)  # N x output_size
    output2 = encoder_net(data2)  # N x output_size

    # We'll then concatenate these together so that the classifier network can see/operate on them both at the same time
    # Otherwise the two datapoints are always treated independently by the network.
    output = torch.cat((output1, output2), dim=1)  # N x (output_size * 2)
    pred = classifier_net(output)  # N x 1, in the range of [0, 1].

    # I'm assume that 'target' is of dimension N that contains a 1 if data1 should be ranked higher. Otherwise a 0.
    loss = criterion(pred, target)
    
    # Optimization step
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Your “target”: is what’s getting me confused here. We need to know if data1 or data2 should be ranked higher. In my example, I’m assuming that target contains this information. Let me know if this is closer to what you’re looking for

111242 · July 31, 2021, 4:02am

I can definitely see why the conditionals are confusing the way I used them along with target. It was a mistake on my part. Instead of getting target from the dataloader, I was trying to figure it out using the conditionals. If data1 should be ranked higher, target=1, else 0. Is this a valid approach?

Thank you for your help. It makes sense. I just want to ask you a final question. When having multiple networks such as in this case, is it recommended that I have the same number of optimizers and do something like

optimizer1 = torch.optim.Adam(encoder_net.parameters(), lr=learning_rate)
optimizer2 = torch.optim.Adam(classifier_net.parameters(), lr=learning_rate)

…

optimizer1.zero_grad()
optimizer2.zero_grad()
optimizer1.step()
optimizer2.step()

or combine them into one:

optimizer = torch.optim.Adam
(
list(encoder_net.parameters()) + list(classifier_net.parameters()),
lr=learning_rate)
)

…

optimizer.zero_grad()
optimizer.step()

ayalaa2 · July 31, 2021, 4:52am

If you did some of comparison on data1 and data2, then I would agree. But if the way to determine the answer was so easy, we wouldn’t need deep learning for this.

One or more optimizers doesn’t really matter in this case. The rationale for multiple optimizers is really to have different hyperparameters for each network (learning rate namely).

In fact, in my example I depicted the problem as utilizing two separate networks, but there’s no reason why you couldn’t just shove it all under one network instead.

111242 · July 31, 2021, 4:55am

Thank you. I need a DNN because I need to apply the trained model on some similar, unlabeled data. Thank you so much.