Proper way to compute loss and backprop for seq tagging with padded batches

I’m working on a sequence tagging algorithm, my input are sentence / labels pairs:

Pytorch   is   so   much   better   than   tensorflow   .     <- sentence
1         0    0    0      0        0      1            0     <- labels

I’m using a classic bi-LSTM with softmax to get one prediction per timestep. At training time, I’m using minibatches of size batch_size * max_sent_length * input_emb_size, where max_sent_length is the length of the longest sentence in the batch, I zero pad the others. I use nn.utils.rnn.pack_padded_sequence to compute just what is needed. Once I forward my input batch into my net, I need to compute the loss and backpropagate. I’m not sure how to properly compute the loss here:

  • The sentences being zero-padded, the only solution I see would be to iterate over each entry in the batch so I can ignore the zero-padded timesteps. Is there a better way ?
  • If I call multiple times the criterion function to compute the loss, do I have to call the backward function each time? Or is there a smart way to accumulate the losses and call backward only once ?