NLP torchtext multiple cuda - the best strategy for a large imbalanced dataset

mark_eu · June 16, 2021, 2:18pm

Hi,

I am working on the problem of multiclass classification on a large imbalanced dataset. Next few weeks, I will have the opportunity to use 4 Cuda devices. The structure of the dataset is simple: TEXT, LABEL.
What is the best strategy in using torchtext for such a situation? Unfortunately, I could not find an example for processing on multiple cuda devices and with such a dataset (large imbalanced dataset - 20 Gb).

Thanks to everyone for the help and links to some working examples

ptrblck · June 17, 2021, 4:20am

You should be able to use DistributedDataParallel to utilize all GPUs in your model training. I’m not aware of any torchtext-specific limitations for this.

mark_eu · June 19, 2021, 6:42pm

@ptrblck Thank you for your reply.
I was trying to find some example for multiclass classification on multiple cudas but without success. Do you know any of such examples?

Kind regards

ptrblck · June 19, 2021, 8:45pm

You could start by creating a training script for the multi-class classification on a single device and then add the distributed training on top of it using e.g. this tutorial.