I’ve been working on implementing a seq2seq model and tried to use
torch.utils.data.DataLoader to batch data following the Data Loading and Processing Tutorial. It seems DataLoader cannot handle various length of data. Or are there other ways to batch different length of data?
I’ve been working on implementing a seq2seq model and tried to use
you could create a transformation that trims / pads each sample to a specific length and then use the pack padded sequence function
You need to customize your own dataloader.
What you need is basically pad your variable-length of input and torch.stack() them together into a single tensor. This tensor will then be used as an input to your model.
I think it’s worth to mention that using
pack_padded_sequence isn’t absolutely necessary.
pack_padded_sequence is kind of designed to work with the LSTM/GPU/RNN from cuDNN. They are optimized to run very fast.
But, if you have your own proposed method that prevents you from using standard LSTM/GPU/RNN, as mentioned here:
The easiest way to make a custom RNN compatible with variable-length sequences is to do what this repo does (masking) https://github.com/jihunchoi/recurrent-batch-normalization-pytorch
Thx sir. Do you mean getting a batch of data and padding them manually? That’s exactly what I’m doing. I’m just wondering if there’s a ‘pytorch’ proper way to do this.
I meant to create your own Dataset class and then do a transform to pad to a given length. An example of a custom dataset class below. The idea would be to add a transform to that which pads to tensors so that upon every call of getitem() the tensors are padded and thus the batch is all padded tensors. You could also have the getitem() function return a third value, which is the original length of the tensor so you can do masking.
I was wondering if there is a more efficient way of padding sequences. The easiest option is to just pad all sequences to the max length possible, currently I implemented my own Dataset object and use a Transform that pads all sequences to the same length. But is there a way to do that per batch and not globally for the whole dataset (pad the batch when DataLoader samples the batch)? Sounds like I need to create a DataLoader?
I found a possible solution at: http://pytorch.org/docs/master/_modules/torch/utils/data/sampler.html.
Specifically, you can implement your own version of
BatchSampler to padd the according to the longest sequence in the batch. I will post my implementatino when done.
I think you want to use the collate_fn function in the DataLoader class.
I did one there with packed sequences. I don’t know if this is the fastest way, but it would accomplish what you want to do. Also you could use any of the pre-built samplers that you wanted.
collate_fn was a good direction . I wrote a simple code that maybe someone here can re-use. I wanted to make something that pads a generic dim, and I don’t use an RNN of any type so PackedSequence was a bit of overkill for me. It’s simple, but it works for me.
def pad_tensor(vec, pad, dim): """ args: vec - tensor to pad pad - the size to pad to dim - dimension to pad return: a new tensor padded to 'pad' in dimension 'dim' """ pad_size = list(vec.shape) pad_size[dim] = pad - vec.size(dim) return torch.cat([vec, torch.zeros(*pad_size)], dim=dim) class PadCollate: """ a variant of callate_fn that pads according to the longest sequence in a batch of sequences """ def __init__(self, dim=0): """ args: dim - the dimension to be padded (dimension of time in sequences) """ self.dim = dim def pad_collate(self, batch): """ args: batch - list of (tensor, label) reutrn: xs - a tensor of all examples in 'batch' after padding ys - a LongTensor of all labels in batch """ # find longest sequence max_len = max(map(lambda x: x.shape[self.dim], batch)) # pad according to max_len batch = map(lambda (x, y): (pad_tensor(x, pad=max_len, dim=self.dim), y), batch) # stack all xs = torch.stack(map(lambda x: x, batch), dim=0) ys = torch.LongTensor(map(lambda x: x, batch)) return xs, ys def __call__(self, batch): return self.pad_collate(batch)
to be used with the data loader:
train_loader = DataLoader(ds, ..., collate_fn=PadCollate(dim=0))
Felix, I think your code only pads correctly if dim=0. This is because in the pad vector in the pad_tensor function has *vec.size()[1:] hardcoded into it. I think you need to create a vector that is pad - vec.size(dim) in the dim dimension and not always in the zeroth dimension. However, I could be wrong. I adapted the code to work with python3 and added the ability to pad with different values, so I may have screwed something up in the process.
David, you are correct, I updated the
pad function to work with any
If you are going to pack your padded sequences later, you can also immediately sort the batches from longest sequence to shortest:
def sort_batch(batch, targets, lengths): """ Sort a minibatch by the length of the sequences with the longest sequences first return the sorted batch targes and sequence lengths. This way the output can be used by pack_padded_sequences(...) """ seq_lengths, perm_idx = lengths.sort(0, descending=True) seq_tensor = batch[perm_idx] target_tensor = targets[perm_idx] return seq_tensor, target_tensor, seq_lengths def pad_and_sort_batch(DataLoaderBatch): """ DataLoaderBatch should be a list of (sequence, target, length) tuples... Returns a padded tensor of sequences sorted from longest to shortest, """ batch_size = len(DataLoaderBatch) batch_split = list(zip(*DataLoaderBatch)) seqs, targs, lengths = batch_split, batch_split, batch_split max_length = max(lengths) padded_seqs = np.zeros((batch_size, max_length)) for i, l in enumerate(lengths): padded_seqs[i, 0:l] = seqs[i][0:l] return sort_batch(torch.tensor(padded_seqs), torch.tensor(targs).view(-1,1), torch.tensor(lengths))
This assumes that your Dataset spits out something like
def __getitem__(self, idx): return self.sequences[idx], torch.tensor(self.targets[idx]), self.sequence_lengths[idx]
And the you pass the pad_and_sort collator to the DataLoader as:
train_gen = Data.DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=pad_and_sort_batch)
For future readers, if like me you were looking for the above (which is great) but also to batch your sequences by their length to minimize padding necessary then I wrote a Batch Sampler for this: Tensorflow-esque bucket by sequence length
To answer the original question, you can pass a (simple and short) custom collate function to the data loader that uses
pack_sequence does not require the sequences to be padded or sorted by length, so it is simpler to use.
Here is the code that does this (based on this answer to a similar question: How to create a dataloader with variable-size input )
from torch.nn.utils.rnn import pack_sequence from torch.utils.data import DataLoader def my_collate(batch): # batch contains a list of tuples of structure (sequence, target) data = [item for item in batch] data = pack_sequence(data, enforce_sorted=False) targets = [item for item in batch] return [data, targets] # ... # later in you code, when you define you DataLoader - use the custom collate function loader = DataLoader(dataset, batch_size, shuffle, collate_fn=my_collate, # use custom collate function here pin_memory=True)
This is how I solved it:
def collate_fn_padd(batch): ''' Padds batch of variable length note: it converts things ToTensor manually here since the ToTensor transform assume it takes in images rather than arbitrary tensors. ''' ## get sequence lengths lengths = torch.tensor([ t.shape for t in batch ]).to(device) ## padd batch = [ torch.Tensor(t).to(device) for t in batch ] batch = torch.nn.utils.rnn.pad_sequence(batch) ## compute mask mask = (batch != 0).to(device) return batch, lengths, mask
- How to create batches of a list of varying dimension tensors?
- How to create a dataloader with variable-size input
- Using variable sized input - Is padding required?
- DataLoader for various length of data
- How to do padding based on lengths?
Stack overflows version:
Here’s a collator I use, it works for tensors of any dimension:
class ZeroPadCollator: @staticmethod def collate_tensors(batch: List[torch.Tensor]) -> torch.Tensor: dims = batch.dim() max_size = [max([b.size(i) for b in batch]) for i in range(dims)] size = (len(batch),) + tuple(max_size) canvas = batch.new_zeros(size=size) for i, b in enumerate(batch): sub_tensor = canvas[i] for d in range(dims): sub_tensor = sub_tensor.narrow(d, 0, b.size(d)) sub_tensor.add_(b) return canvas def collate(self, batch, ) -> List[torch.Tensor]: dims = len(batch) return [self.collate_tensors([b[i] for b in batch]) for i in range(dims)]
Then I simply use:
zero_pad = ZeroPadCollator() loader = DataLoader(train, args.batch_size, collate_fn=zero_pad.collate)```
For the others who might have the same issue with RNN and multiple lengths sequences, here is my solution if your dataset __getitem__ method returns a pair (seq, target) :
from torch.nn.utils.rnn import pad_sequence def collate_fn_pad(list_pairs_seq_target): seqs = [seq for seq, target in list_pairs_seq_target] targets = [target for seq, target in list_pairs_seq_target] seqs_padded_batched = pad_sequence(seqs) # will pad at beginning of sequences targets_batched = torch.stack(targets) assert seqs_padded_batched.shape == len(targets_batched) return seqs_padded_batched, targets_batched dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_fn_pad) for seq, labels in dataloader: y_pred = rnn(seq)