Complicated transform makes Pytorch dataloader extremely slow

rkd1137 · July 18, 2021, 1:26am

I’m performing a slightly complicated form of compression on the MNIST dataset, that involves sequentially taking QR decompositions of the input vector (i.e., start with an input vector of length L, split it into a vector with shape (2, L/2), take a QR decomposition of that, then repeat on the R matrix). Because it is sequential, it is hard to parallelize, and generally takes some nontrivial time to perform (about 1 second on my machine). This would be fine, but Pytorch does not appear to be doing any kind of caching when I do this – successive epochs are just as slow as the first one. I’m fairly new to Pytorch, so I wanted to ask if this is the expected behavior? If so, should I perform the transformation first, then save my own local copy of the transformed dataset, and make a new dataloader from that?

ptrblck · July 19, 2021, 6:08am

The Dataset doesn’t use any caching mechanism, as you would often lazily load each sample and transform it on-the-fly. Caching could thus disable the random transformations and could additionally blow up your memory (in case you want to store each sample).
Assuming your transformation is not random, then your suggestion sounds valid and you could indeed process all samples once, store them, and load these samples in the Dataset.__getitem__.

rkd1137 · July 19, 2021, 7:07pm

Thanks. Is there a preferred method to do the processing? Right now, I thought to do something like:

dataset = MNIST(f'/tmp/mnist/',
                download=True,
                transform=Compose([# List of transforms
                                   ]), train=True)

dl = DataLoader(dataset, batch_size=60000)
x, y = next(iter(dl))
torch.save(x, "train_data.pt")
torch.save(y, "train_targets.pt")

Is there a more efficient/preferred method?

ptrblck · July 19, 2021, 7:43pm

If your entire dataset contains 60000 samples and they also fit into your memory (MNIST would, but I won’t know if you’ve used it as a placeholder), then your approach looks correct and you could later torch.load the data.
However, in case your dataset is larger (and doesn’t fit into your RAM), you might want to store each sample or multiple samples in each file (the latter case might make the data loading a bit more complicated).