I built a POS tagger and it’s having trouble learning (which is anticipated given the small dataset and large number of tags). To help, I’d like to remove the padding dimension from the output layer, so it stops predicting padding as one of the tags. Right now, pad_idx=0 and I’m wondering if there’s an easy (and reusable) way to always make the pad_idx the last element in the vocabulary so that I can just make output_dim=len(TRG.vocab)-1? Or is there a more elegant approach to this?
I ended up not having a padding token in the vocabulary and then I padded the batch with whatever index was len(TRG.vocab)+1. Whenever I needed the pad_idx (for ignore_idx in the criterion for example) I just used len(TRG.vocab)+1. Note that I built my own Dataset and Loader so this may not work if you’re not padding your batches yourself.