Single model on single GPU with multi processing

Hi everyone

Now I have samples with TxNxD size and each sample may have a different T and N. Since T and N may vary a lot, I do not want to pad. As a result, I have to use a batch_size 1.

Do you have any advice to accelerate the training? (I have a server with multiple 12GB GPUs and multiple CPUs. My model only use 1GB+ GPU memory :frowning:

I have read the convenient torch.nn.DataParallel API but it seems that I need to have the same size for all samples to form batches. Right now, I am trying to read DistributedDataParallel documentation .

So could you please tell me what I should do? If you could provide some detailed documentation or even some toy codes, it would be best. (I have limited knowledge about multiprocessing and distribution.)

Thanks a lot