Apply softmax over multiple segments of output

Hi all,

I am faced with the following situation. I am using one model to solve multiple classification tasks, where each classification task itself is multi-class, and the number of possible classes varies across classification tasks. To give an example:

The model outputs a vector with 22 elements, where I would like to apply a softmax over:

  1. The first 5 elements
  2. The following 5 elements
  3. The following 8 elements
  4. The last 4 elements

This is because the model is simultaneously solving 4 classification tasks, where the first 2 tasks have 5 candidate classes each, the third task has 8 candidate classes, and the final task has 4 candidate classes.

I would also like to define an appropriate cross-entropy loss that follows this same structure.

My questions are:

  1. How can I use torch.nn.Softmax to achieve this?
  2. How can I define the custom cross-entropy loss mentioned above?

Many thanks!

Hello Ege!

First, for numerical-stability reasons, you shouldn’t use Softmax.
As I outline below, you should use CrossEntropyLoss, which has,
in effect, Softmax built into it.

You don’t need to write a custom cross-entropy loss. Just use
pytorch’s built-in CrossEntropyLoss four times over, once for
each of your classification tasks.

Your model outputs a batch of prediction vectors of shape [nBatch, 22].
Your targets could be packaged in a number of ways. The most
straightforward is probably to have four sets of targets, one for
each classification task. Let’s call the four tasks A, B, C, and D.
Your targets should be batches of integer class labels.

So, for example, targetA should have shape [nBatch] and consist
of class labels that run from 0 to 4, because task A has five classes.
targetB should be the same. targetC should also have shape
[nBatch], but consist of class labels that run from 0 to 7 because
task C has eight classes.

Then:

loss_fn = torch.nn.CrossEntropyLoss()  # only need to do this once

lossA = loss_fn (prediction[:,   0:5], targetA)
lossB = loss_fn (prediction[:,  5:10], targetB)
lossC = loss_fn (prediction[:, 10:18], targetC)
lossD = loss_fn (prediction[:, 18:22], targetD)

loss = lossA + lossB + lossC + lossD

That is, you use indexing to snip out of your vector of 22 predicted
class labels the set of predictions relevant to each task. So in this
example, labels 10 through 17 (as indexed by 10:18) are the eight
predicted class labels relevant to task C.

Best.

K. Frank

1 Like

Hello K. Frank,

Thank you for your swift reply. I applied the changes you recommended, and I’m now faced with the following issue:

In the second iteration of the training loop, the following runtime error is produced:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

When I specify retain_graph=True when calling backward the first time, and set it to False in the subsequent iterations, I then receive the following error message:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [200, 336]]...

I have determined that this tensor corresponds to the last layer of my fully connected network. Below, you can find the relevant code snippet from the training loop:

# update the gradients to zero
        optimizer.zero_grad()

        # forward pass
        logits = model(x)
        
        #compute loss
        for i in range(18):
            loss_array[i] = criterion(logits[:,i*11:(i+1)*11], torch.argmax(labels[:,i*11:(i+1)*11], dim=1))

        for i in range(9):
            loss_array[18+i] = criterion(logits[:,198+i*8:198+(i+1)*8], torch.argmax(labels[:,198+i*8:198+(i+1)*8], dim=1))
        
        for i in range(6):
            loss_array[27+i] = criterion(logits[:,270+i*11:270+(i+1)*11], torch.argmax(labels[:,270+i*11:270+(i+1)*11], dim=1))

        loss = torch.sum(loss_array)

        # backward pass
        loss.backward(retain_graph=first)
        first = False
        train_loss += loss.item()
        
        # update the weights
        optimizer.step()

The code is a bit more complicated than the example we discussed earlier. I had to use an array to hold all the losses as I have 33 classification tasks instead of 4.

I suspect that I am not using optimizer.step() in the correct way as that is the only operations that updates the network layers.

Would you have an insight into the problem with my code?

Many thanks!
Ege

Hello Ege!

I’m not entirely sure what is going on here. You don’t say what
loss_array is, but since you call torch.sum (loss_array),
I will assume that loss_array is some kind of pytorch tensor.

If so, indexing into loss_array multiple times could be your
problem.

(You also don’t say what criterion is. Let me assume that is is
criterion = torch.nn.CrossEntropyLoss(), and therefore that
criterion (...) returns a pytorch tensor of shape [1], that is
a single number packaged as a tensor.)

Try something like:

        #compute loss

        loss = 0.0   # python scalar

        for i in range(18):
            # loss will become a pytorch tensor
            loss = loss + criterion(logits[:,i*11:(i+1)*11], torch.argmax(labels[:,i*11:(i+1)*11], dim=1))

        for i in range(18):
             # etc., ...

        loss.backward()

If my theory is right, this use of retain_graph = True is incorrect,
and, rather than fixing the real issue, is just hiding it. So just call
loss.backward(), as outlined above, without specifying retain_graph.

Good luck.

K. Frank

Hi Frank,

thank you for this answer. Would it make sense to weight these loss components? So that lossA would be weighted by 5/22 etc. Even if all tasks (A…D) have the same priority.

Best, Peter