Training model with cuda stream

Hi,

Due to some reasons, my model have to be trained with batch size of one

and it is very inefficient to be trained in GPU

I’d like to increase the efficiency by using cuda stream.

For instance, I’d like to train my model asynchronously in GPU using 32 streams to make effect of using batch size of 32

I can do such things in CPU with help from https://pytorch.org/docs/master/notes/multiprocessing.html.

Anyone know how to do it in GPU or any other way?

Simple model code would be really helpful

Thank you