Hey friends,
I have been trying for a while now to implement a particular model with PyTorch, but I’m encountering some technical and conceptual issues that I would appreciate any help with.
On an abstract level, I have a sequence of vectors x, each one of which needs to be assigned a label, y.
I am running a twolayer bidirectional LSTM on x, out of which I get a new vector sequence, c.
If I apply a simple linear/logsoftmax layer on top of c, I obtain good enough accuracy on y.
The labels themselves, however, can be represented as short sequences, say s, made out of a small number of atomic symbols. I thought it intuitive to replace the linear/logsoftmax layer with a simple decoder operating on each of the c vectors so as to generate my labels in sequential format.
However, the new network seems incapable of learning, which I find quite baffling. Is there some perhaps some conceptual thing that I am missing here, or is it more likely I have an implementation error?
Some extra info:

When applying XEntropy on the labels in sequential format, my batch size explodes; rather than len(x) * batch_size, I am now performing my optimization on len(x) * batch_size * max(len(s)) (i.e. I have to predict the correct atomic label at each timestep of a label sequence for each item in my original sequence for each sample in the batch size), which might inhibit learning. Reducing the batch size slows down training quite drastically but also does not seem to improve results.

If I opt for a nonreduced loss (i.e. CrossEntropy(reduction=‘none’)), and perform the summation manually over the batch indices and the input sequence axis, training takes way too long (due to the necessity of retaining the graph, and results are even poorer). I think I am a bit lost on how loss reduction works; shouldn’t an unreduced (i.e. highdimensional) loss provide more informative error signals for backpropagation to work on?

Pretraining the network using the linear/logsoftmax top layer and then replacing it does not improve results.

Training loss is largely unstable and often starts increasing as soon as the first or second epoch, even when I use a smaller learning rate, at which points the results are nowhere close to good.

Is the extra recurrent layer indirectly increasing the network depth and capacity to the point where learning is unfeasible?
EDIT: Core code can be found here (link to SO question)