Incorrect size for labels and predictions for loss fn (embedding layer))

Hindy_Yuen · September 19, 2021, 9:42am

I am new to pytorch. I am trying to implement a simple model with the embedding layer.

I am using the SMS spam dataset.

I have already converted and padded the text data into indexes and built a bacth_generator function which yield batches of targets and inputs in tensor.

when I am trying run the training loop, it said:

ValueError: Using a target size (torch.Size([1])) that is different to the input size (torch.Size([1, 83, 7405])) is deprecated. Please ensure they have the same size.

It seems the size of the predictions from the model is not the same as the size of the target batches. Hence, it cannot calculate the loss.

What should I do if they are the same size?

Model:

class NLP(nn.Module):
def init(self, embedding_size=50, vocab_size=vocabSize):
super(NLP, self).init()
self.embeddings = nn.Embedding(vocabSize, embedding_size)
self.linear1 = nn.Linear(embedding_size, 100)

def forward(self, inputs):
    lookup_embeds = self.embeddings(inputs)
    out = self.linear1(lookup_embeds)
    out = F.log_softmax(out)
    return out

Training loop:

losses = []
loss = nn.BCELoss()
model = NLP(vocab_size=vocabSize, embedding_size=50)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
total_loss = 0
for l, t in bacth_generator(train, 32):
model.zero_grad()
prediction = model(t)
output = loss(prediction, l)

ValueError: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([32, 83, 100])) is deprecated. Please ensure they have the same size.

ptrblck · September 20, 2021, 5:07am

nn.BCELoss expects the model output and target to have the same shape.
Based on your code snippet it seems that the model output has two additional dimensions, so could you explain what the model is supposed to predict and how this would fit the target shape?
Also, use torch.sigmoid, as nn.BCELoss expects probabilities coming from the model or (better) remove F.log_softmax and use nn.BCEWithLogitsLoss for more numerical stability.

Hindy_Yuen · September 21, 2021, 2:05pm

Thank you for your reply. This is supposed to be binary classification model. I am trying to learn how to implement an embedding layer. As for how to fit the target shape, this is exactly what I want to know!! Why the prediction become 3 dimensional ?

Hindy_Yuen · September 21, 2021, 2:12pm

Besides,

Is there anything or concept that I should remember as a beginner when implementing pytorch model ?
Is there any tutorial or learning materials that you recommend ?

ptrblck · September 21, 2021, 5:09pm

I guess because you are feeding a 2D input to the model as seen here:

embeddings = nn.Embedding(100, 10)
linear1 = nn.Linear(10, 100)

x = torch.randint(0, 100, (2, 3))
out = embeddings(x)
print(out.shape) # torch.Size([2, 3, 10])
out = linear1(out)
print(out.shape) # torch.Size([2, 3, 100])

You can print the shapes of all tensors in the forward method to check it manually.

Depending on your knowledge about ML/DL you might want to check out a course (e.g. FastAI) or just take a look at the tutorials. It’s a bit hard to recommend anything else without knowing what you are struggling with.

Hindy_Yuen · September 22, 2021, 4:49am

I have noticed that other models will have layers like torch.max() and torch.sum() right after the nn.Linear() layer. Which makes the 3D output from the previous layer become 2D again, then I am able to use it with the labels to calculate the loss.

So now my question has become, why would someone pick torch.max() or torch.sum() or other layers ? Is that part of the hyperparameter tuning procedure ?

ptrblck · September 22, 2021, 4:58am

I think it depends on your use case and what the inputs represent.
A 2D input using indices (as is used in my example) could represent e.g. a temporal signal, where each time step is assigned to an index. This approach could be used e.g. if you are mapping words to indices and are trying to pass “sequences of words” (i.e. sentences) to the model.
I’m not familiar with your use case and thus also don’t know what the expected predictions are, e.g. a per-word or sentence classification etc.
In case you are trying to reproduce a specific approach explained in a paper, I would assume the authors might have explained their approach there.