I’m doing something a little out of the ordinary using multi-GPU + multi-process training and ran into what appears to be a race. Uncertain if it’s a bug or an expected corner case that’s just not supported.
So, text summary:
dataset -> loader -> model J (GPU 0) -> tensor y_ja (GPU 0) -> tensor y_jb (GPU 1) -> MP queue -> model K (GPU 1) -> tensor y_k (GPU 1)
I’ve got a model J with input, params, output on GPU 0. It’s running in a process started from the main using torch.multiprocessing. The tensors from this model are copied into a larger, zero initialized tensor (1…n times larger) on GPU 1 before being placed in a Queue. The data is being assigned using slicing ops (ie y_jb[idxo:idx_o+step] = y_ja[idx_i:idx_i+step
])
So, essentially, right before being placed on the MP Queue, a copy is being initiated of data from GPU 0 -> GPU 1. The main process is dequeuing and running the data through model K with input, params, output on GPU 1.
In this scenario, I’m seeing what looks like a race condition. The data in the main process, once fetched from the Queue has varying blocks of 0 lining up with the slice assignment in the source process. However, it changes from batch to batch and also frequency varies with the number of debug prints, or other code I place in the source process (enqueue side). My hunch, the GPU -> GPU transfer isn’t necessarily completed before descriptors, etc are placed in the queue and received on the other end.
If I place output from model J into a CPU tensor (GPU 0 -> CPU) -> enqueue -> (CPU -> GPU 1) -> model K everything is okay. Also, if I remove the multi-processing and do the the GPU 0 -> 1 slice assignment in the same process with no queue, everything is okay.