Multi-processing and GPU -> GPU tensor slicing race. Expected corner case or bug?

I’m doing something a little out of the ordinary using multi-GPU + multi-process training and ran into what appears to be a race. Uncertain if it’s a bug or an expected corner case that’s just not supported.

So, text summary:
dataset -> loader -> model J (GPU 0) -> tensor y_ja (GPU 0) -> tensor y_jb (GPU 1) -> MP queue -> model K (GPU 1) -> tensor y_k (GPU 1)

I’ve got a model J with input, params, output on GPU 0. It’s running in a process started from the main using torch.multiprocessing. The tensors from this model are copied into a larger, zero initialized tensor (1…n times larger) on GPU 1 before being placed in a Queue. The data is being assigned using slicing ops (ie y_jb[idxo:idx_o+step] = y_ja[idx_i:idx_i+step])

So, essentially, right before being placed on the MP Queue, a copy is being initiated of data from GPU 0 -> GPU 1. The main process is dequeuing and running the data through model K with input, params, output on GPU 1.

In this scenario, I’m seeing what looks like a race condition. The data in the main process, once fetched from the Queue has varying blocks of 0 lining up with the slice assignment in the source process. However, it changes from batch to batch and also frequency varies with the number of debug prints, or other code I place in the source process (enqueue side). My hunch, the GPU -> GPU transfer isn’t necessarily completed before descriptors, etc are placed in the queue and received on the other end.

If I place output from model J into a CPU tensor (GPU 0 -> CPU) -> enqueue -> (CPU -> GPU 1) -> model K everything is okay. Also, if I remove the multi-processing and do the the GPU 0 -> 1 slice assignment in the same process with no queue, everything is okay.

you can try inserting torch.cuda.synchronize() right before the enqueue maybe?
CUDA calls are async in nature, but if you only do all GPU calls, ordering is preserved by CUDA streams. The stream ordering guarantees dont extend across processes (each process has it’s own CUDA context and stream)

Thanks, a synchronize before the enqueue appears to do the trick.