Multi-label classification as array output in pytorch

Hi, I’m trying to make a classifier using a CNN. The data points consist of a 70x70 image and 5 labels represented as a list of length 5 as each image contains up to 5 digits.

Here’s example of a label for an image which contains the digits 1,5 and 9.

[1, 5, 9, -1, -1]

Below are snippets of parts of my code so far, for reference…

class complexNet(nn.Module):
    def __init__(self):
        super(complexNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, ...)
        ...
        self.fc5 = nn.Linear(120, 11) # 11 possible classes: digits 0-9 and no digit, -1.

    def forward(self, x):
        x = x.cuda()
        x = self.pool(torch.sigmoid(self.conv1(x)))
        ...   
        x = x.view(-1, c*w*h )
        ...
        x = self.fc5(x)
        return x

model = complexNet()
optimizer = optim.Adam(...)
loss = nn.MultiLabelSoftMarginLoss()

for i, batch in enumerate(train_loader, 0):
        x,y = batch
        logit = model(x)
        _, predicted = torch.max(logit, dim=1)
        print(predicted)
        J = loss(logit,y)
        print(J.item())

The print statement outputs something like this, where each tensor is of length 10 since I’m using a batch size of 10:

...
tensor([6, 6, 6, 6, 6, 6, 6, 6, 6, 6], device='cuda:0')
tensor([ 6,  6,  6,  6,  6,  6,  6,  6, 10,  6], device='cuda:0')
tensor([6, 4, 6, 5, 6, 6, 6, 5, 6, 6], device='cuda:0')
tensor([6, 6, 6, 6, 6, 6, 5, 6, 6, 6], device='cuda:0')
...

Ideally, an example of a predicted value would look like the following:

tensor([
 [1,2,3,4,5],
 [1,0,0,-1,-1],
 [5,2,3,0,-1],
 [9,2,3,3,-1],
 [0,-1,-1,-1,-1]
...
...
], device='cuda:0')

In addition, unsurprisingly the loss function yields the following error:

D:\SoftwarePackages\Anaconda\lib\site-packages\torch\nn\functional.py:1594: UserWarning: Using a target size (torch.Size([10, 5])) that is different to the input size (torch.Size([10, 11])) is deprecated. Please ensure they have the same size.
  "Please ensure they have the same size.".format(target.size(), input.size()))
...
ValueError: Target and input must have the same number of elements. target nelement (50) != input nelement (110)

For simplicity, we can break the problem down into 2 main concerns for the time being:

  1. I would like each number in the tensor to instead be an array of length 5, where each number is a digit from the image.
  2. Should this be done in the forward function, or within the training loop?

I’ve mostly tried to solve this problem based off of advice from 2 other topics, but have not had any success.

Any help at all is hugely appreciated, Thank you!

What does the -1 mean? What’s the significance of the ordinal of each target int (which I assume is a class designation)? What’s preventing you from using a more “traditional” approach, e.g. a standard multi-label classifier? You’d definitely have more than 5 labels given your problem description, but it’d be much more straightforward to implement and would still let you rank your labels according to the strength of the match. Are you familiar with expressing labels using multi-hot encoding? If not, I can give you a rundown. If so, I probably misunderstood your question and don’t know enough to help… :wink:

Hi Sam, thank you for responding to my post,

I guess it was pretty important and I should have clarified it in my post, but the digits on the image are to be displayed from left to right in the array, as they appear in the same order in the image.
For example, if the the left most digit in the image is 1, then 5 appears somewhere to the right and 9 is the right-most digit, then the label is [1,5,9,-1,-1]. We include the two -1’s in the array as placeholder values, since the list must be of length 5 as that is how the labels are represented in the original dataset. I figured keeping all lists at length 5 would simplify training, but maybe I’m wrong? Hopefully this clears up the problem a little bit.

The reason I chose to use a CNN was because we are predicting many mutually non-exclusive class labels corresponding with a 70x70x1 image input, so I figured a convolutions was the best approach to analyzing visual imagery. But of course I’m open to any and all suggestions/advice! I’m not very familiar with expressing labels using multi-hot encoding, could you could provide a quick head-start on the topic and how it could be applied in the context of this problem?

I’m still struggling to understand your use case, so I’m not sure this will help, but to multi-hot encode a class list dataset with integer labels of:
Sample 0: [3, 7, 4]
Sample 1: [2, 4]
Sample 2: [9, 1, 6, 2]
Sample 3: [8, 9]
you’d transform it into something like:

Label:     0  1  2  3  4  5  6  7  8  9  ...  N
           |  |  |  |  |  |  |  |  |  |       |
           v  v  v  v  v  v  v  v  v  v       v
Sample 0: [0, 0, 0, 1, 1, 0, 0, 1, 0, 0, ..., 0]
Sample 1: [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, ..., 0]
Sample 2: [0, 1, 1, 0, 0, 0, 1, 0, 0, 1, ..., 0]
Sample 3: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ..., 0]

where N is the total number of classes. Once you’ve transformed the labels, it’s pretty straightforward to put together a simple CNN using BCELossWithLogits. This is a common strategy for basic CNNs and is the basis of most “multi-label CNN” tutorials, but you’ll notice it does NOT preserve order or provide bounding boxes for detected entities/labeled objects. If you’re looking for that more advanced functionality, I’d recommend looking at:

but that would take you out of the torch framework altogether. Hopefully, someone more knowledgeable can comment and point you in the right direction for using torch. If not, here’s a link to a post that discusses multi-label classification problems in general. You might find some inspiration there…