I’m getting an issue
F tensorflow/stream_executor/cuda/cuda_driver.cc:153] Failed setting context: CUDA_ERROR_NOT_INITIALIZED: initialization error when calling torch geometric
DataLoader (which is inherited from
torch.utils.data.DataLoader) when setting
num_workers > 0 when running using CUDA (e.g.
CUDA_VISIBLE_DEVICES=0 python script.py). After that, the script failed with
DataLoader worker (pid(s) 31890) exited unexpectedly. Any explanation/solution? I read that the most common solution is to use
num_workers = 0 but I want to parallelise the dataloading since reading the dataset is the current bottleneck (my dataset is saved in TFRecords). Thanks
The error seems to be raised from TensorFlow and I would guess that your use case tries to re-initialize a CUDA context and thus might break. Try to use the
spawn method and it might work.
Hey @ptrblck, thanks for replying!
Do you mean calling
torch.multiprocessing.set_start_method('spawn') before calling the function?
If I do something like this:
if __name__ == "__main__": torch.multiprocessing.set_start_method('spawn') training_func()
It would throw an error:
I tensorflow/stream_executor/cuda/cuda_driver.cc:739] failed to allocate 2.2K (2304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
The initial error seems to be solved and TF fails to allocate device memory now.
I guess you might want to change TF’s behavior to allocate memory if needed and not to grab all device memory, but the TF discussion board might be a better place as you would find the experts there.
Ok I’m trying to sort out that issue now . In the case that the GPU memory is fully occupied because the model takes a lot of memory, can I still use GPU to run multiprocessing in dataloader? Is there any equivalent way like tf.data.Dataset where we can fetch and the process happens in CPU?
DataLoader uses the
Dataset.__getitem__ to fetch each sample which would use the CPU by default unless you are explicitly using the GPU. PyTorch will not use your GPU(s) by default.