Text multiclass classification


(Holiv) #1

Hello Forks,

I am doing text classification using Pytorch and Torchtext. Therefore, my problem is that i am getting a very low accuracy compared to the one i expected. Did i make any mistake in the computation of my accuracy or in the evaluation function?

  1. My dataset has 5 labels (1,2,3,4,5), i converted them to index_to_one_hot like this:

def index_to_one_hot(label):
sample_nums=label.size()[0]
one_hot=torch.tensor([0.,1.,2.,3.,4.])
one_hot=one_hot.view([1,5]).expand([sample_nums,5])
label=label.view([sample_nums,1]).expand([sample_nums,5])
one_hot=(label.float()==one_hot).float()
return one_hot

  1. My function that compute the accuracy is like this:

def compute_accuracy(preds, y):
p_top1=preds.topk(1,dim=1)[1]
y_top1=y.topk(1,dim=1)[1]
correct=(p_top1==y_top1).float().sum()
label_nums=preds.size()[0]
return correct,label_nums

  1. My training function is like this:
    #labels are called (Overall)
    #My x is (ReviewText)
    def train(model, iterator, optimizer, criterion):

    epoch_loss = 0
    epoch_cor = 0
    epoch_label=0
    model.train()
    for batch in iterator:
    optimizer.zero_grad()
    predictions = model(batch.ReviewText).squeeze(1)
    loss = criterion(predictions,batch.Overall)
    correct,label_nums= compute_accuracy(predictions, index_to_one_hot(batch.Overall))
    loss.backward()
    clip_gradient(model, 1e-1)
    optimizer.step()
    epoch_loss += loss.item()
    epoch_cor +=correct
    epoch_label+=label_nums
    return epoch_loss / len(iterator), epoch_cor/epoch_label

  2. My evaluation function is like this:
    def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_cor = 0
    epoch_label=0
    model.eval()
    with torch.no_grad():
    for batch in iterator:
    predictions = model(batch.ReviewText).squeeze(1)
    loss = criterion(predictions, batch.Overall)
    correct,label_nums = compute_accuracy(predictions, index_to_one_hot(batch.Overall))
    epoch_loss += loss.item()
    epoch_cor +=correct
    epoch_label+=label_nums
    return epoch_loss / len(iterator), epoch_cor/epoch_label

  3. I used CrossEntropyLoss() as the loss the function.

Thanks for any assistance.


#2

Your code to compute the accuracy seems to work for my dummy tensors.
Usually you wouldn’t need one-hot encoded targets, but anyway this code seems to be fine.
I guess the reason for the low accuracy might be somewhere else.
Did you play around with some hyperparameters (learning rate etc.)?
Is the model learning at the beginning at all?


(Holiv) #3

Thanks Dear @ptrblck . As i said, my target labels are (1,2,3,4,5) stars and my last layer is the linear layer self.label = nn.Linear(out_channels, output_size)

However, when i don’t use one hot to encode the targets, the following error is thrown.

y_top1=y.topk(1,dim=1)[1]

RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Thank you.


#4

The target labels should be in the range [0, nb_classes-1], so you might want to subtract them by one.
If you don’t use one-hot encoded labels you won’t need to get the topk/argmax, since the labels already point to the class index.


(Holiv) #5

Thank you. If possible you can send me that dummy tensors you tested that. I can help more.:pray:


#6

Sure! I just used these random input and checked the result manually:

label = torch.randint(0, 5, (10,))
index_to_one_hot(label)
preds = torch.randn(10, 5)
compute_accuracy(preds, index_to_one_hot(label))

(Anton Melnikov) #7

If you are training your model to do multilabel classification, where each input may belong to multiple targets, CrossEntropyLoss may not be a good idea. At the core of it, CrossEntropyLoss computes the softmax of the output, which means the probability of all output classes will be part of the same probability distribution. That is good for training a multiclass classifier, which has to assign only one, “best” target class to each input, based on the highest probability.

In most multilabel tasks, you need to compute independent probabilities of each output class. For that, your output nonlinearity should be a sigmoid, and I’ve found that a good loss function is MultiLabelSoftMarginLoss.


(Holiv) #8

Thank you for your reply. I confused the title, It is multiclass classification. The labels for my five classes are: 1,2,3,4,5.