Dataloader num_workers>1 cuda initialization error 3

Lin1 · August 25, 2022, 3:11am

Hello everyone,

I have implemented a c++ preprocessor with pybind11. The c++ preprocessor is imported by a customized dataset wraped by torch.utils.data.Dataloader. One of the preprocessor’s member function is then called in getitem for generating feature/label for further training. It works fine.

Based on this preprocessor, I hope to re-implement part of the feature/label generation function with cuda kernels, then copy the result on gpu to cpu, then pass to the python side. The error shows up: cuda initialization error 3 (points to cudamemcpy/cudamemset/…functions), when i set num_works > 0 (it works fine when num_workers = 0).

Any idea for solving it?

Environment:
OS: ubuntu 18.04
Pytorch: 1.11.0
Pybind11: 2.10.0
GPU: 3090 *1
Cuda: 11.3

Thanks,
Lin

ptrblck · August 25, 2022, 4:14am

Since you are trying to re-initialize a CUDA context, you would need to use the "spawn" start method as described here.

Lin1 · August 25, 2022, 7:07am

hello @ptrblck
Thanks for the reply.

I am a little bit confused. As the dataloader’s “num_worker” param seems indicating using multi-threads for fetch batches of data, using multi-process(mp.spawn) means using processes instead of threads? I.e., using 64 processes (64 dataloader instance with num_worker=1) instead of using 64 threads (1 dataloader instance with num_worker=64) , such that cuda kernels(based on pybind11) would work inside each dataloader instance?

ptrblck · August 25, 2022, 7:17am

num_workers>=1 will spawn a new process for each worker, so that each of them can load and process a batch of samples in the background while the main process is busy with the model training.
Using mutli-threading is often not a good idea in Python as you would hit the Global Interpreter Lock and might not be able to preload data in the background.

Lin1 · August 25, 2022, 11:35am

Thanks for helping figure out what is happening @ptrblck

I tried ddp with mp.spawn, it works. Basically, I follow the tutorial here and the error disappears.

It is still unclear for me why torch.multiprocessing works. I did several experiments:

NO mp.spawn , one gpu, num_workers=1, the dataset is initialized 1 times, error occur.
DDP + mp.spawn , two gpu, num_workers=1, the dataset is initialized 4 times, no error.

The key part seems whether the dataset class is initialized for each process (two main processes + two sub-processes for two gpus?) Note: cuda memory is initialized within the c++/cuda preprocessor.

I dig a little into the dataloader’s source code, find that the dataset instance is passed directly to the dataloader as a param and do not find any func calling the dataset’s init().

My questions are:

Do the dataset class being initialized outside the dataloader class in mp.spawn mode?
“re-initialize a CUDA context” mentioned in first comment indicates the cuda memory initialization in c++ side?
how mp.spawn init dataset for each worker while, without mp.spwan, dataset class is initialized only once?

Many thanks for your time.