Text multiclass classification

holiv · January 2, 2019, 1:08am

Hello Forks,

I am doing text classification using Pytorch and Torchtext. Therefore, my problem is that i am getting a very low accuracy compared to the one i expected. Did i make any mistake in the computation of my accuracy or in the evaluation function?

My dataset has 5 labels (1,2,3,4,5), i converted them to index_to_one_hot like this:

def index_to_one_hot(label):
sample_nums=label.size()[0]
one_hot=torch.tensor([0.,1.,2.,3.,4.])
one_hot=one_hot.view([1,5]).expand([sample_nums,5])
label=label.view([sample_nums,1]).expand([sample_nums,5])
one_hot=(label.float()==one_hot).float()
return one_hot

My function that compute the accuracy is like this:

def compute_accuracy(preds, y):
p_top1=preds.topk(1,dim=1)[1]
y_top1=y.topk(1,dim=1)[1]
correct=(p_top1==y_top1).float().sum()
label_nums=preds.size()[0]
return correct,label_nums

My training function is like this:
#labels are called (Overall)
#My x is (ReviewText)
def train(model, iterator, optimizer, criterion):

epoch_loss = 0
epoch_cor = 0
epoch_label=0
model.train()
for batch in iterator:
optimizer.zero_grad()
predictions = model(batch.ReviewText).squeeze(1)
loss = criterion(predictions,batch.Overall)
correct,label_nums= compute_accuracy(predictions, index_to_one_hot(batch.Overall))
loss.backward()
clip_gradient(model, 1e-1)
optimizer.step()
epoch_loss += loss.item()
epoch_cor +=correct
epoch_label+=label_nums
return epoch_loss / len(iterator), epoch_cor/epoch_label
My evaluation function is like this:
def evaluate(model, iterator, criterion):
epoch_loss = 0
epoch_cor = 0
epoch_label=0
model.eval()
with torch.no_grad():
for batch in iterator:
predictions = model(batch.ReviewText).squeeze(1)
loss = criterion(predictions, batch.Overall)
correct,label_nums = compute_accuracy(predictions, index_to_one_hot(batch.Overall))
epoch_loss += loss.item()
epoch_cor +=correct
epoch_label+=label_nums
return epoch_loss / len(iterator), epoch_cor/epoch_label
I used CrossEntropyLoss() as the loss the function.

Thanks for any assistance.

ptrblck · January 4, 2019, 10:09pm

Your code to compute the accuracy seems to work for my dummy tensors.
Usually you wouldn’t need one-hot encoded targets, but anyway this code seems to be fine.
I guess the reason for the low accuracy might be somewhere else.
Did you play around with some hyperparameters (learning rate etc.)?
Is the model learning at the beginning at all?

holiv · January 5, 2019, 1:52am

Thanks Dear @ptrblck . As i said, my target labels are (1,2,3,4,5) stars and my last layer is the linear layer self.label = nn.Linear(out_channels, output_size)

However, when i don’t use one hot to encode the targets, the following error is thrown.

y_top1=y.topk(1,dim=1)[1]

RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Thank you.

ptrblck · January 5, 2019, 12:10pm

The target labels should be in the range [0, nb_classes-1], so you might want to subtract them by one.
If you don’t use one-hot encoded labels you won’t need to get the topk/argmax, since the labels already point to the class index.

holiv · January 5, 2019, 1:11pm

Thank you. If possible you can send me that dummy tensors you tested that. I can help more.

ptrblck · January 5, 2019, 7:17pm

Sure! I just used these random input and checked the result manually:

label = torch.randint(0, 5, (10,))
index_to_one_hot(label)
preds = torch.randn(10, 5)
compute_accuracy(preds, index_to_one_hot(label))

notnami · January 10, 2019, 4:31pm

If you are training your model to do multilabel classification, where each input may belong to multiple targets, CrossEntropyLoss may not be a good idea. At the core of it, CrossEntropyLoss computes the softmax of the output, which means the probability of all output classes will be part of the same probability distribution. That is good for training a multiclass classifier, which has to assign only one, “best” target class to each input, based on the highest probability.

In most multilabel tasks, you need to compute independent probabilities of each output class. For that, your output nonlinearity should be a sigmoid, and I’ve found that a good loss function is MultiLabelSoftMarginLoss.

holiv · January 11, 2019, 12:43am

Thank you for your reply. I confused the title, It is multiclass classification. The labels for my five classes are: 1,2,3,4,5.