Batching and outputting multiple objects with IterableDataset

lucf · March 31, 2020, 10:25am

Hi All,

I have a very specific application where I use grammar rules to generate random sequences of tokens (up to a maximum length), and train a VAE with these tokens, as in https://arxiv.org/pdf/1703.01925.pdf. I generate sequences on the fly in an IterableDataset, which converts the token sequences to collections of one-hot vectors (one per sequence timestep). Note that the VAE is defined as a Pytorch-Lightning model. Here’s the outline of my custom dataset class:

class SentenceGenerator(IterableDataset):
    def __init__(self, grammar, min_sample_len, max_sample_len, batch_size=256, seed=0):
        super().__init__()
        ...

    def generate_sentence(self):
        # generates a string of tokens ('sentence') using the rules encoded in self.grammar
        ...
        return ''.join(sent)

    def generate_one_hots(self):  
        # converts 'sentences' to sequences of one-hot vectors
        self.sents = [self.generate_sentence() for _ in range(self.batch_sz)]
        out = make_one_hot(self.grammar, self.tokenizer, self.prod_map, self.sents, max_len=self.max_len,
                           n_chars=self.n_chars)
        return out.transpose(2, 1)  # (batch_size, vocab_size, max_length)

    def __iter__(self):
        return iter(self.generate_one_hots())

Batching: For some reason, generating a single matrix of one-hot vectors and letting the DataLoader batch them didn’t work; the batches were always of size one, plus I was having other downstream issues with Pytorch-Lightning. Therefore I resorted to handling batching directly within SentenceGenerator, as you can see above. In the DataLoader, I then have to specify the same batch size as in the Dataset for batches to be generated. It is a bit hacky and causes some headaches again downstream in terms of understanding what an epoch is, when to step the LR scheduler, when to log a result, etc. Is there a more elegant way of achieving the same result?
Multiple outputs from the dataset: In order to help the VAE train, I would like to explicitly pass it the sentence lengths. This cannot be directly derived from the matrix of one-hot vectors because each token can take a variable number of timesteps to produce. Therefore, I need to output this value directly from the IterableDataset like so:

    def generate_one_hots(self):
        ...  # same as above except for `return` statement
        return out.transpose(2, 1), self.sent_lengths  # now returning 2 tensors

    def _iter__(self):
         return iter(self.generate_one_hots())

… and then upack each batch in vae.forward() as x, n = batch where n contains the lengths of the ‘sentences’ represented by the tensor of one-hot vectors x. However, in this implementation, only the first object returned by generate_one_hots() is included in batch and I get a “too few values to unpack” error. Would you have any suggestions? Ideally one that solves both issues at once!

Thanks a million in advance.