BCE Loss Stalling for Multilabel, Multiclass Image Classification

I’m trying to predict a multilabel multiclass output from a series of image features (i.e. panoramic images passed through Resnet-152 giving a tensor of BATCH_SIZE X NUM_IMAGES X 36 (NUM_VIEWS) x 2048 (FEATURE_SIZE)). The output of the model is BATCH_SIZE X NUM_CLASSES (~3000). Note that there is a heavy imbalance in the classes. One series of images may have multiple objects in it, so I used BCE loss (with logits).

CV is not my specialty. Below is the model I’ve been using (which may be part of the problem). There is an assumption that one particular image contributes to the output vector, which is why Conv1D was used (I wasn’t sure how to process only one set of panoramic images at a time) and the outputs of each convolution (all of CLASS_SIZE) are max pooled together.

class LandMarkPredictionModule(nn.Module):
    def __init__(self, num_classes, num_views=36, kernel_size = 4):
        super(LandMarkPredictionModule, self).__init__()
        self.num_views = num_views
        self.kernel_size = kernel_size
        self.num_classes = num_classes
        self.conv = torch.nn.Conv1d(in_channels = self.num_views, out_channels= 1, kernel_size = self.kernel_size) # figure out
        self.fc = nn.Linear(in_features = 2045, out_features = self.num_classes) # How to calculate this value automatically
        self.act_func = nn.Sigmoid()

    def forward(self, images):
        batch_size, length, num_views, feat_size = images.shape
        reshaped_images = images.reshape(batch_size * length, num_views, feat_size)
        convolved_images = self.conv(reshaped_images).squeeze(1)
        output = self.act_func(self.fc(convolved_images))
        output = output.reshape(batch_size,length, self.num_classes)
        return output.max(dim=1)[0]

The loss stalls out at ~0.69, even in a case where I try to over fit to just 1 data point. Below is some example code and the data used.

import torch.optim as optim

pred_module = LandMarkPredictionModule(len(landmark_classes))
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(pred_module.parameters(), lr=1e-3)
# USE BCE LOSS
# Manage class imbalances
NUM_EPOCHS = 100
tst_images = torch.from_numpy(np.stack((image_class_pairs[0][0],image_class_pairs[3][0]),axis=0))
tst_labels = torch.from_numpy(np.stack((image_class_pairs[0][1],image_class_pairs[3][1]),axis=0))

for epoch in range(NUM_EPOCHS):
    optimizer.zero_grad()

    logits = pred_module(tst_images)
    loss = criterion(logits,tst_labels)
    loss.backward()
    print(loss.item())
    optimizer.step()

The sizes of the inputs and outputs above are:

I’m in the process of experimenting with the model, but beyond that what else could cause the loss to stay stuck (Data, Bug in the Code)?

Hi Felix!

Because your are using BCEWithLogitsLoss (seemingly reasonable
for your use case), you should get rid of the Sigmoid after your final
Linear layer, fc. (Sigmoid converts logits to probabilities, likely not
what you want.)

Other than that, I don’t have much intuition about your use case, or
what “NUM_IMAGES” (apparently 6) might mean conceptually.

Having said that, the shenanigans where you reshape your data
into a “pseudo-batch” of size batch_size * length, pass it through
your network, and then reshape it back to a batch of batch_size
strike me as a little odd.

Best.

K. Frank

1 Like

Because your are using BCEWithLogitsLoss (seemingly reasonable
for your use case), you should get rid of the Sigmoid after your final
Linear layer, fc . ( Sigmoid converts logits to probabilities, likely not
what you want.)

Thank you K. Frank!! I face palmed hard after reading this, I hadn’t considered the effects of that activation layer…

Changing that activation function to LeakyReLU helped, the loss now goes down to 0.05 after 100 epochs over 2 samples.

Other than that, I don’t have much intuition about your use case, or
what “NUM_IMAGES” (apparently 6 ) might mean conceptually.

The network’s input is a tensor of image features representing panoramic views along a path (36 views separated by 30 degrees, with 12 views looking up, 12 looking forward, 12 looking down). NUM_IMAGES is the length of the path (could have been named better), in the example case there are 6 points. The goal of the network is to find which objects are present at a given time step.

The problem is that I don’t have a ground truth for the objects present at each point in the path, but I do know which objects appear in the path. The idea is that a model trained to predict objects present in the path, with an independence assumption between time steps aggregated using max pooling, could be used to classify single points afterwards.

Having said that, the shenanigans where you reshape your data
into a “pseudo-batch” of size batch_size * length , pass it through
your network, and then reshape it back to a batch of batch_size
strike me as a little odd.

To process each point in the path separately, I decided to use a 1d convolution, but that required reshaping the output. Rather than flatten it and lose some of the benefits of the current structure, I tried what you described above.

Originally the plan was to use a 2d convolution, but it seems to require iterating through the data to avoid using multiple points of the path at once.

Since the images do capture a panoramic view, I wanted to capture “interactions” between neighboring images. A fully connected layer could do the trick, but it would be big ~(36 x 2048 x HIDDEN_SIZE), hence the use of convolutions.

Hi Felix!

In the most common usage, you would have no activation between
your last Linear layer and your BCEWithLogitsLoss loss criterion,
neither a Sigmoid nor a LeakyReLU.

Also, on further inspection, I see that you do not have a nonlinear
activation between your Conv1D and your Linear layer. Again,
in the most common usage, you would. (ReLU,, Sigmoid, and
LeakyReLU could all be reasonable choices.)

Just thinking out loud here …

It seems to me that there is something inherently “sequential” about
the 12 views in each of the three horizontal “slices” that make up your
panoramic “images” – hence a convolution could make sense – but
that the 36 views together don’t make up a clean “sequence.” Perhaps
the “up,” “forward,” and “down” views should be “channels,” and not
convolved over. Or perhaps a vertical up-forward-down “slice” counts
as a sequence, so maybe a Conv2d would make sense.

In a similar vein, if the 6 points in your paths are “close enough”
together, perhaps they should be treated as a sequence, and
convolved over.

So perhaps you should have a three-channel (up-forward-down)
Conv2d that convolves over the six points in your paths and the
twelve views in your horizontal panorama slices. Or perhaps even
a full Conv3d over all three.

Best.

K. Frank

In the most common usage, you would have no activation between
your last Linear layer and your BCEWithLogitsLoss loss criterion,
neither a Sigmoid nor a LeakyReLU .

Makes sense, considering the shape of the sigmoid adding a ReLU after the linear layer would make my lowest possible value close to 0.5.
For now, I’ve also tried with 3 FC layers and it outperforms the previous code (5-fold cross validation).

        self.input_layer = nn.Sequential(nn.Dropout(self.dropout_rate),
                                         nn.Linear(2048,hidden_sizes[0]),
                                         nn.LeakyReLU(),
                                         nn.Flatten(start_dim=-2),
                                         nn.Linear(hidden_sizes[0] * 36, hidden_sizes[1]),
                                         nn.LeakyReLU(),
                                         nn.Linear(hidden_sizes[1], num_classes)
                                        )

In a similar vein, if the 6 points in your paths are “close enough”
together, perhaps they should be treated as a sequence, and
convolved over.

The points are physically rather far and may contain dissimilar information. Think navigating in street view on google maps, but with obstructions between the navigable points. Moreover, wouldn’t that output a smaller length sequence at the output? The goal is to get a probability distribution at each time step at eval time (it’s for a down stream task).

So perhaps you should have a three-channel (up-forward-down)
Conv2d that convolves over the six points in your paths and the
twelve views in your horizontal panorama slices. Or perhaps even
a full Conv3d over all three.

Good point! I don’t think that the points are arranged like that. From what I remember the first 12 views are bottom, next 12 are middle, last 12 are top, I’ll check. With a stride of 12 it should work. It’s getting late, but I’ll try that tomorrow and see how it performs.