Speed of Multi-layer LSTMs using PackedSequence

Hi all,

I have a multi-layered LSTMs, and I expected a faster training if I use a packed sequence instead of padded tensors of the longest sequence length. However, a comparison shows the padded input is slightly faster than the packed one. In my understanding, the packed input runs the loops in the layers lesser. doesn’t it?

Thanks!

Do you have the code to run the experiments? I would be quite interested since I too am using packed sequence thing.

Here is a test code:

import time
import numpy as np
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_sequence, pad_sequence

device = torch.device("cuda")

batch_size = 32
input_size = 100
hidden_size = 512
seq_len_range = (50, 200)
epoch = 10

rnn = nn.LSTM(input_size, hidden_size, num_layers=4, bias=True)
rnn.to(device)

inputs = list()
for i in range(epoch):
  seq = [torch.rand((np.random.randint(*seq_len_range), input_size)) for i in range(batch_size)]
  seq = sorted(seq, key=lambda x: x.size(0), reverse=True)
  inputs.append(seq)

# for padded input
start = time.time()
for i in inputs:
    x = pad_sequence(i).to(device)
    y = rnn(x)
end = time.time()
print(f"elapsed time for padded input: {end - start} secs")

# for packed input
start = time.time()
for i in inputs:
    x = pack_sequence(i).to(device)
    y = rnn(x)
end = time.time()
print(f"elapsed time for packed input: {end - start} secs")

The result was:

elapsed time for padded input: 0.6328160762786865 secs
elapsed time for packed input: 0.6410703659057617 secs

Interestingly, in cpu mode, the packed input is faster than the padded one:

elapsed time for padded input: 27.869688272476196 secs
elapsed time for packed input: 20.38231635093689 secs