LSTMs, GPUs, and Packed Sequences

Does PyTorch 0.4 take advantage of the GPU when batch training LSTMs with masked sequences?

I have noticed, as I put together a toy example of a multi-layer LSTM with teacher forcing (i.e., the whole input can be fed to the network at the same time) that I only see performance gains on GPU vs CPU when I increase the overall size of the network, not when I increase the batch size. I.e., the PyTorch does not seem to parallelize over batches.

Is this the expected behavior?

I am using pack_padded_sequence for eventual use with masked sequences, but right now the sequences are uniform length to eliminate that variability. All of the following are moved to device when testing with the GPU:

  • Hidden layers
  • Model
  • Criterion
  • Inputs
  • Targets

The loss function is relatively simple, uses pairwise distance, and acts directly on the packed sequences correctly without needing to be re-padded.

Is this expected behavior?
Or am I missing some way to coax batch parallelism out of an RNN?

(Note this is a single-GPU question. In trying to answer this question myself, I found a lot of descriptions of DataParallel, but that seems to be a multi-GPU solution.)