After I initialize my model, I want to train it separately for different datasets sets (X1,Y1), (X2,Y2), …, (X50,Y50), and so on. My naive way to do it is to train it for Y1, save the weights, then re-initialize and train it for Y2, save those weights, and repeat for each dataset (X, Y).

But the model is exactly the same each time, so I feel like there should be some way to take advantage of pytorch’s optimized GPU-parallelized autograd to train for all these different labels Yi at the same time, than doing it one at a time. How can I most efficiently use Pytorch for this task?

I am trying to train it separately. So I have one set of parameters P1 trained on (X1, Y1), another set of parameters P2 trained on (X2, Y2), and so on.

Well u can do it the normal way and just set batch size to 1 (setting batch size to 1 will only feed the network one feature target pair at a time) then u create another for loop that trains on that single data point for n number of cycles, then u write a function that saves each weight to a different file in a folder, and another function that resets the weights for another data point.

If u want to keep track of which weight belongs to which data point then u don’t need to shuffle ur dataset.

As I said I am already training one dataset (Xi, Yi) at a time, saving the parameters Pi, re-initializing, and doing it again for (Xi+1, Yi+1) and so on.

My question was, since it’s the same model and loss function and optimizer each time, is there an efficient way for me to train on these different datasets at the same time, like I train on multiple examples within one dataset at the same time via mini batches. Except, instead of averaging the gradients like a minibatch, I want to keep multiple sets of parameters Pi, one for each dataset (Xi, Yi), and only update Pi with the gradients associated with the examples in the minibatch that come from (Xi, Yi).

Hmmm🤔
If u are looking for a more efficient way to do this well… , but I mean u can try out what i said in my previous answer coz that’ll also save and update parameters Pi for (xi, yi) and Pi+1 for (xi+1, yi+1) and so on.

What I can suggest u do to make things faster:
Use the garbage collector to empty redundant variables

Share the work amongst different gpus (but u said u have only 1 so u should consider cloud service)

Prototype the implementation in python and do the main implementation in C++ then u can run inference in python

you can keep a copy of your model’s weights and optimizer’s weights for each of your datasets and alternatively switch back and forth between these two sets of weights as you train on your two separate datasets.
But this is a confusing way to train…

With one GPU, if your script uses <50% of GPU memory, your best bet to increase GPU utilization is launching multiple scripts (processes).

In-process parallelization, using multiple cuda streams with replicated model modules may work but it is hard to write this correctly (in a nutshell, you must avoid blocking cuda operations [like copy() and item()], or use python threads)

And there is no way to do a tensor level parallelization on multiple input+parameter sets.