I have a very specific application where I use grammar rules to generate random sequences of tokens (up to a maximum length), and train a VAE with these tokens, as in https://arxiv.org/pdf/1703.01925.pdf. I generate sequences on the fly in an IterableDataset, which converts the token sequences to collections of one-hot vectors (one per sequence timestep). Note that the VAE is defined as a Pytorch-Lightning model. Here’s the outline of my custom dataset class:
class SentenceGenerator(IterableDataset): def __init__(self, grammar, min_sample_len, max_sample_len, batch_size=256, seed=0): super().__init__() ... def generate_sentence(self): # generates a string of tokens ('sentence') using the rules encoded in self.grammar ... return ''.join(sent) def generate_one_hots(self): # converts 'sentences' to sequences of one-hot vectors self.sents = [self.generate_sentence() for _ in range(self.batch_sz)] out = make_one_hot(self.grammar, self.tokenizer, self.prod_map, self.sents, max_len=self.max_len, n_chars=self.n_chars) return out.transpose(2, 1) # (batch_size, vocab_size, max_length) def __iter__(self): return iter(self.generate_one_hots())
Batching: For some reason, generating a single matrix of one-hot vectors and letting the DataLoader batch them didn’t work; the batches were always of size one, plus I was having other downstream issues with Pytorch-Lightning. Therefore I resorted to handling batching directly within
SentenceGenerator, as you can see above. In the DataLoader, I then have to specify the same batch size as in the Dataset for batches to be generated. It is a bit hacky and causes some headaches again downstream in terms of understanding what an epoch is, when to step the LR scheduler, when to log a result, etc. Is there a more elegant way of achieving the same result?
Multiple outputs from the dataset: In order to help the VAE train, I would like to explicitly pass it the sentence lengths. This cannot be directly derived from the matrix of one-hot vectors because each token can take a variable number of timesteps to produce. Therefore, I need to output this value directly from the IterableDataset like so:
def generate_one_hots(self): ... # same as above except for `return` statement return out.transpose(2, 1), self.sent_lengths # now returning 2 tensors def _iter__(self): return iter(self.generate_one_hots())
… and then upack each batch in
x, n = batch where
n contains the lengths of the ‘sentences’ represented by the tensor of one-hot vectors
x. However, in this implementation, only the first object returned by
generate_one_hots() is included in batch and I get a “too few values to unpack” error. Would you have any suggestions? Ideally one that solves both issues at once!
Thanks a million in advance.