Batch predict big dataset

psonneveld · August 15, 2021, 4:04pm

Hello all, I am pretty new to PyTorch, so I hope this is not too dumb a question, but I am running into a problem with prediction on a dataset using my trained PyTorch model. I have a trained model for sentiment analysis, that all went well. But I need to apply it to a dataset of about 1.1 million unlabeled texts. I have tried doing so a few times, but I tend to end up with too high RAM usage, causing either my Colab session or my own computer to crash.
Someone suggested doing the prediction in batches, but I don’t understand how I would do so. Could someone please point me in the right direction for how I am supposed to predict new labels in this use case? Any help would be greatly appreciated!

ptrblck · August 16, 2021, 4:01am

I’m not sure if your use case has specific requirements, but the common approach would be to use a Dataset, wrap it in a DataLoader and process each batch using the model.
The dataloading tutorial might be a good starter.

psonneveld · August 17, 2021, 12:51pm

Thank you for the suggestion! I am using the method suggested here for training the model: ML-and-Data-Analysis/RoBERTa for text classification.ipynb at master · aramakus/ML-and-Data-Analysis · GitHub
It saves the model as two files, a model.pkl and a metrics.pkl file.
Basically, I am trying to predict new labels using this model. Is the dataloader then the right approach?

ptrblck · August 17, 2021, 6:47pm

In case you would like to create the predictions for a dataset, using a DataLoader sounds like a good approach. On the other hand, if you would like to get a prediction of a single sample, you could directly pass it to the model, so the “best” approach depends a bit on your use case.