Train model parallel in special task

I have one GPU card - A100, and had 80GB GPU memory.
in NLP Task , I try to using parallel training model, but it not work, due to data is more special

for example to my data.

sample 1 , 5 batch, each batch size is 500
sample 2, 1 batch, each batch size is 500
sample 3, 1574 batch, each batch size is 500


sample 1000000, ? batch, each batch size is 500

as you see, each sample had different batch number,
the best way I can do is try to flatten all of batch number to 1

for example:

sample 3-1, 1 batch, each batch size is 500
sample 3-2, 1 batch, each batch size is 500


sample 3-1574, 1 batch, each batch size is 500

but I don’t wanna do it, since each batch may had some relationship, if I split them, performence may not well.

so I want to ask is there some method can using single GPU but multiple process method.

here is I find the similar method.

Multiprocessing best practices — PyTorch master documentation

but it not work to me, when I try to running it on jupyter notebook, always show cuda initial error message.