My training steps consist of running a model twice, with some numpy computation in between.
So, the basic workflow is:
- Fetch a batch of data samples using dataloader which uses a custom dataset class
- Run the model: output = model(input)
- Perform some processing on each data point in the output, I am converting tensors to numpy arrays here.
- Request new samples from the custom dataset class instance based on the results of previous step
- Run inference again and backpropagate based on the last inference only.
Now steps 3 and 4 are currently sequential, i.e. I loop through the batch one at a time. This is quite time consuming.
Are there any ways to cleanly parallelize the steps 3 and 4 together.
I could parallelize step 3. But, fetching data again from the custom dataset class is sequential (step 4), which is time consuming due to heavy computations required for each data point.
So, later i tried another idea: pushed my first inference (step 1) into the getitem of the dataset class, hoping that the dataloader can do steps 1 to 3 (4 is not needed anymore) in batches.
But the issue is i can not have a GPU model now. Because the way multiprocessing works in Python, it doesnt allow to process CUDA tensors in the subprocess. Python interpreter complains about CUDA initialization error. Hence, one way now is to keep two model copies one in the CPU for the first inference which is injected into the dataset and another a GPU one for 2nd inference and back propagation, but that means somehow the GPU and CPU models needs to be synchronized after every back propagation step? Am I thinking the right solution here?
Can anyone suggest something in this regard? any hints and ideas would be appreciated.
I also have tried to force the python multiprocessing lib to only ‘spawn’ iso. ‘fork’/‘forkserver’ so that the CUDA context error can be mitigated, but the issue is then a large amount of data need to be serialized and deserialized, which is not possible.