Pytorch equivalent to tensorpack's DataFlow/LMDB Serializer

glefundes · April 16, 2020, 12:22pm

Hi!

When using the regular Dataset/Dataloader interfaces, the IO speed is a clear bottleneck for me (no access to SSD) and training leaves the GPU very hungry. I have somewhat overcome this by using tensorpack’s LMDB serializer to store the data and load the data as bytes sequentially from a single LMDB db file.

The thing is, tensorpack cannot serialize PyTorch tensors or PIL images, which leaves me having to do the preprocessing (convert from numpy array to tensor, and apply torchvision transforms) during training time. While this method provides a significant performance increase speed-wise, the preprocessing is still huge bottleneck. Besides, the network for some reason does not converge as well as when using PyTorch’s regular dataset/dataloader (even though it takes 6x more time for a single epoch).

My question is, is there an equivalent package to tensorpack that can serialize native PyTorch fomats and read them from an LMDB database or similar? If not, is there a more efficient way to do what I’m doing here? I realize the simple answer would be to get a SSD drive for the machine, but that’s out of my control.