Best way to load sparse data in Pytorch

Hi all I am working with some of bag of word features, my dataset is n x p where n is the number of documents and p is the vocab size. p is quite large - 16 million but it is sparse.

The feature matrix is represented as a scipy sparse (CSR format) which I build a custom Pytorch Dataset around. I then write my own batching mechanism w/o the use of Python DataLoader which takes the Pytorch Dataset (getitem gives us a scipy sparse X and a numpy y (0,1) ) and converts X into Pytorch Sparse Tensor and y into a simple FloatTensor and puts them in the following format:
[ [X_b1, y_b1], [X_b2, y_b2], … ] which I then use as normal in training. I am getting decent results with this in terms of speed.

I did try using normal Python dataloaders, however this was a lot slower then the custom batching I did above for some reason.

I am looking for ways to speed up my code, I believe there should be faster ways to load and batch sparse inputs in PyTorch and looking for code samples and/or recommendations here?


1 Like