I have the following problem. Immagine that for whatever reason you want to divide the batch used to perform a step in smaller ones; this could happen, for example, if you want to test your model on a batch size bigger than your memory capacity so that you end up with a micro batching approach.
In my case I using microbatching for a different reason; I want to divide the gradient associated to each class.
To perform micro batching I define a dataloader for each and, at each step, I perform a for loop selecting the batch of one of them, calculating the associated gradient, copying it on a temp variable.
When all the classes have propagated one of their batches the for loop end, I update the weights and pass to the next step reitereting the same procedure.
for key in set(NetInstance.TrainDL): # NetInstance is a class where TrainDL (a dict of dataloaders) is defined #for each class we select a single batch from each class dataloader and repeat the above procedure) try: img, lab = next(ClassesIterables[key]) except StopIteration: ClassesIterables[key] = iter(NetInstance.TrainDL[key]) #when we finished the element of the dataset we reshouflle and restart with the new sequence img, lab = next(ClassesIterables[key]) img = img.double() #load data on device img = img.to(device) lab = lab.to(device) #... #propagate batch along the Net, backprop to get gradient # store gradient on a temp var
Since each iteration in the above for loop is independent from the others I would like to parallelize the iterations using the
To do so I need that some variables are shared between processes while other have to be divided for each process, namely:
the temp variable where I store the gradient I want that is shared between the processes
on the other hand I don’t want that the gradient calculated in one process interact in any way with the one of a different process (after all I performed cthe micro batching exactly to divide the different gradient contribution).
I was planning then the following approach:
define different processes at each step, each one take care of a single iteration of the above for loop; this shoud be automatically create an istance of the variables inside te block (so also of the network and the computed grad)
multiprocessing.shared_memoryto store the variable shared by the different processes (e.g. the temp grad copy)
My questions are:
- is this a good approach to reach my goal? is there a more easy way?
- are there some important caveat I should have in mind following this way?
- branching the main process in many processes, each one processing a small batch could create a problem in terms of required memory (as each process work on an instance copy of the model/grad)?
I found this question:
Gradient disconnected after Multiprocessing pool(starmap)
which partially solved my doubts. In particular, from what I read there:
torch.set_grad_enabled(False)before branching the processes will automatically create an independent gradient for each of them; if instead a grad tensor is associated to the net it will be shared between processes. Is it right?