Moving data to GPU in collate_fn fails

setting cudnn.benchmark didn’t help (True and False).

setting CUDA_LAUNCH_BLOCKING=0 or 1 didn’t help

doing the .to call inside the for loop (instead of the collate function) works, but that’s what i’d like to avoid because i’d like to have the data pre-stored on the GPU already by the worker s.t. the data can be directly accessed.

yes i think that the child processes for the batch workers create a new cuda context. what should be done is to get the cuda context of the parent process and initialize the tensor there. but i don’t know how to do that.

Also I’ve tried the solutions from here:


However, I got the error:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable