Asynchronous computation on CPU while training on CUDA?

Hi, I am trying to train a relatively large model on GPU, and do some memory-intensive computation at the end of each epoch which cannot be done on GPU. Currently I am converting model to cpu at the end of each epoch, do the computation and then convert it back to gpu. But it would take relatively long time for the conversion and computation, and there is really no need to wait for the computation to finish before training next epoch. So I am wondering whether it is possible to copy the model to cpu and let it run the computation asynchronously, without interrupting the training on GPU?

Thank you!

If the CPU computation uses the model, you would have to wait for the synchronizing to('cpu') operation since you would have to push the model to the CPU and back to the GPU again.
On the other hand, GPU operations are already executed asynchronously, so in case the CPU operation has no dependency on the model etc. it would already be executed while the model is busy with the training on the GPU.

But the “main” thread is still busy in this case, i.e. it won’t multiplex CPU & GPU “command streams”. And I don’t think that blocking points for GPU sync. are “cooperative” with CPU ops (that are themselves blocking). So, I believe, additional thread/process is still needed for “off-model” CPU computations.

@ptrblck @googlebot Thank both for your comments! Is it possible to have a deep CPU copy of the model at the end of each epoch, let the original model keeps training on GPU while at the same time this CPU copy does other computation?

I don’t see any obstacles for that, as long as you clone model tensors cleanly. I’d try something like model2.load_state_dict({k:v.detach().cpu().clone() for k,v in model.state_dict().items()}) (clone() may be unwarranted as cpu() already copies)

1 Like

right, the main thread is busy with the CPU computation so GPU work won’t proceed on its own. Do you have idea what might be a good way to start a separate thread for CPU computation (and no need to wait for it while GPU is doing computation)?

At this point it is not related to pytorch, you should just use any convenient mechanism for concurrent execution + have some re-synchronization point in the main thread if necessary. I’d suggest python threading if your jobs take at most a few seconds, for longer jobs (minutes) - it may be worth it to launch a subprocess. If you use jit, you may also check torch.jit.fork.