I’m trying to implement a multi-label classification task, and currently my model has Embedding, GRU, 2x Linear layers.
I have padded the data, and its shape is (seq_len x batch) where seq_len is the longest sequence in that batch. Targets are multi-hot encoded as I’m using BCEWithLogitsLoss.
I have a weird issue that when using batch size > 1, I get much lower accuracy (0.3) when using batch size = 1 (0.8). I suspected it might be padding thing, but I was also able to reproduce this with same length sequences. I’m trying my luck here if anyone has encountered something similar and what the problem was?
I assume your input has the size [batch_size, seq_len]?
If so, then self.gru would get an input of [batch_size, seq_len, embedded_dim], while it expects an input of [seq_len, batch_size, input_size] in the default setup.
If my assumptions are correct, you could either permute the input or use batch_first=True while creating the nn.GRU, which would then expect an input of [batch_size, seq_len, features].