Dataloader num_workers>1 cuda initialization error 3

Hello everyone,

I have implemented a c++ preprocessor with pybind11. The c++ preprocessor is imported by a customized dataset wraped by One of the preprocessor’s member function is then called in getitem for generating feature/label for further training. It works fine.

Based on this preprocessor, I hope to re-implement part of the feature/label generation function with cuda kernels, then copy the result on gpu to cpu, then pass to the python side. The error shows up: cuda initialization error 3 (points to cudamemcpy/cudamemset/…functions), when i set num_works > 0 (it works fine when num_workers = 0).

Any idea for solving it?

OS: ubuntu 18.04
Pytorch: 1.11.0
Pybind11: 2.10.0
GPU: 3090 *1
Cuda: 11.3


Since you are trying to re-initialize a CUDA context, you would need to use the "spawn" start method as described here.

1 Like

hello @ptrblck
Thanks for the reply.

I am a little bit confused. As the dataloader’s “num_worker” param seems indicating using multi-threads for fetch batches of data, using multi-process(mp.spawn) means using processes instead of threads? I.e., using 64 processes (64 dataloader instance with num_worker=1) instead of using 64 threads (1 dataloader instance with num_worker=64) , such that cuda kernels(based on pybind11) would work inside each dataloader instance?

num_workers>=1 will spawn a new process for each worker, so that each of them can load and process a batch of samples in the background while the main process is busy with the model training.
Using mutli-threading is often not a good idea in Python as you would hit the Global Interpreter Lock and might not be able to preload data in the background.

Thanks for helping figure out what is happening @ptrblck

I tried ddp with mp.spawn, it works. Basically, I follow the tutorial here and the error disappears.

It is still unclear for me why torch.multiprocessing works. I did several experiments:

  1. NO mp.spawn , one gpu, num_workers=1, the dataset is initialized 1 times, error occur.
  2. DDP + mp.spawn , two gpu, num_workers=1, the dataset is initialized 4 times, no error.

The key part seems whether the dataset class is initialized for each process (two main processes + two sub-processes for two gpus?) Note: cuda memory is initialized within the c++/cuda preprocessor.

I dig a little into the dataloader’s source code, find that the dataset instance is passed directly to the dataloader as a param and do not find any func calling the dataset’s init().

My questions are:

  1. Do the dataset class being initialized outside the dataloader class in mp.spawn mode?
  2. “re-initialize a CUDA context” mentioned in first comment indicates the cuda memory initialization in c++ side?
  3. how mp.spawn init dataset for each worker while, without mp.spwan, dataset class is initialized only once?

Many thanks for your time.