MultiLabel Problem

logarith · December 4, 2020, 9:13pm

Hi,

I’m trying to solve a multi label problem
I have a tensor of around 400*2000 values
The 2000 are zeros and ones, but the vectors have in average only 10 of the 2000 values with a one the others are zeros.
A one should have more importance than the zeros.
I standardize the values with a mean square algorithm.
So this is my first question. Is this good in that case?

I also have output tensors with the size of 60 classes, which are not mutual exclusive. There are always 10 classes one and the others zero.

This is my network

network = torch.nn.Sequential(
            torch.nn.Linear(len(self.getVector()), 250),
            torch.nn.ReLU(),
            torch.nn.Linear(250, 150),
            torch.nn.ReLU(),
            torch.nn.Linear(150, 60),            
        )

       loss_function = torch.nn.MultiLabelSoftMarginLoss()

       optimizer = torch.optim.Adam(network.parameters(), lr=0.0007)
       network.train()
       for i in range(500):
            predicted_value = network(test_input_tensor)
            loss = loss_function(predicted_value, test_output_tensor)
            print(i, loss.item())
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()


        network.eval()
        output = network(prognostic_input_tensor)

As I have not much experience in machine learning, I want to know
if you have some advice, if this is a good approach for a multi label probelm with the
features mentioned above?
It seems to me that it predicts a lot of negative values, what I dont understand.

KFrank · December 5, 2020, 1:16am

Hi Log!

logarith:

I also have output tensors with the size of 60 classes, which are not mutual exclusive. There are always 10 classes one and the others zero.
network = torch.nn.Sequential(
            ...
            torch.nn.Linear(150, 60),            
        )

       loss_function = torch.nn.MultiLabelSoftMarginLoss()
It seems to me that it predicts a lot of negative values, what I dont understand.

Because the output of your model is the output of your last Linear
layer, you are predicting raw-score logits. A logit value that is less
than zero corresponds to a predicted probability less than one half.
Typically a probability of less than one half for the “1” state when
interpreted as a hard “yes-no” prediction would be taken to be a
“0”-state prediction (and greater than one half would be the “1” state).

Many more of your “output-tensor” target values are 0’s than are
1’s, so if you weight each individual target value equally in loss
function, your model can train to do a good job on the loss function
by preferentially predicting 1’s (that is, predicting negative logits),
regardless of the input data.

The common approach to addressing this is to weight your
less-frequent 1 target values more heavily in your loss function.

Note that BCEWithLogitsLoss is essentially the same as
MultiLabelSoftMarginLoss but has a pos_weight argument
that you can pass to its constructor.

You say that you have 60 classes, and that any given sample target
has 10 classes in the 1 state and 50 in the 0 state. If all of your
classes are about equally likely to be in the 1 state, you could use
the same pos_weight for all of them. A reasonable value would be
pos_weight = n_negative / n_positive. So:

loss_function = torch.nn.BCEWithLogitsLoss (pos_weight = torch.tensor ([5.0]))

If the likelihood of your different classes having target value 1 are
not all broadly similar, then you would pass in a tensor of length 60
for your pos_weight, that is, a different pos_weight value for each
class.

Best.

K. Frank

logarith · December 5, 2020, 1:20pm

thanks for the answer,
but for the input tensors, do I have to normalize them before passing them to the network,
or can I input tensors consisting of ones and zeros?

KFrank · December 5, 2020, 8:04pm

Hi Log!

Passing in the “raw” tensors should be fine. Being ones and zeros,
they are already close to being normalized. Changing them to, say,
-1 and 1 so (if they were fifty-fifty) they would have a mean of 0 and
a standard deviation of 1 wouldn’t affect things much. (Try it both
ways – I doubt you’ll see any difference.)

(In contrast, think about a 16-bit grayscale image as input to a
network. The pixel values run from zero to about 65,000, so they
can be rather large. Normalizing the pixel values so that they are
of order one makes like easier for the network.)

Best.

K. Frank