Batching with padded sequences and pack_padded_sequence

maria · January 1, 2020, 8:23am

I’m new to PyTorch, and trying implement a language model with an LSTM. I’ve checked out two very good posts explaining padding and pack_padded-sequences:
https://suzyahyah.github.io/pytorch/2019/07/01/DataLoader-Pad-Pack-Sequence.html

However, I don’t see how I can pass the input length variable when batching, as I need it for the pack_padded_sequence.

Also, I would appreciate an explanation on the collate_fn in batching.

G.M · January 1, 2020, 8:55am

I’m not clear about what r u trying 2 do in Q1, can u explain more?

collate_fn is the function used to combine a bunch of samples taken from a Dataset to something that can be fed into a module. For instance, if your dataset returns a single tensor with one index, then an acceptable collate_fn could be torch.cat or something that combines these tensors into a batch(usually a bigger tensor).

maria · January 1, 2020, 10:05am

After padding, I will need to use something like the following (from the 2nd link):

        X = torch.nn.utils.rnn.pack_padded_sequence(x, **X_lengths**, batch_first=True)

        # now run through LSTM
        X, self.hidden = self.lstm(X, self.hidden)

        # undo the packing operation
        X, _ = torch.nn.utils.rnn.pad_packed_sequence(X, batch_first=True)

The X_length parameter is needed for the RNN to ignore the padded parts, and it’s passed to the forward function. I have no problem calculating it for the entire dataset, but how do I pass it if I choose to batch the data?
I though of using the collate function and returning the length for each batch, but than how is it passed to the forward function of the module?

G.M · January 1, 2020, 10:35am

You r correct, it is usually done through the collate_fn. Just change the default collate_fn of DataLoader to your collate_fn and u r good to go. DataLoader first get the samples from the Dataset, pass it to the collate_fn and return it each iteration. So what your collate_fn returns is what you get from the DataLoader.

hadaev8 · January 1, 2020, 11:56am

Check this example

github.com

NVIDIA/tacotron2/blob/master/data_utils.py#L80


def __init__(self, n_frames_per_step):
    self.n_frames_per_step = n_frames_per_step


def __call__(self, batch):
    """Collate's training batch from normalized text and mel-spectrogram
    PARAMS
    ------
    batch: [text_normalized, mel_normalized]
    """
    # Right zero-pad all one-hot text sequences to max input length
    input_lengths, ids_sorted_decreasing = torch.sort(
        torch.LongTensor([len(x[0]) for x in batch]),
        dim=0, descending=True)
    max_input_len = input_lengths[0]


    text_padded = torch.LongTensor(len(batch), max_input_len)
    text_padded.zero_()
    for i in range(len(ids_sorted_decreasing)):
        text = batch[ids_sorted_decreasing[i]][0]
        text_padded[i, :text.size(0)] = text