[W CudaIPCTypes.cpp:22] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

I got this warning at the end of each epoch when using multiple GPUs:
[W CudaIPCTypes.cpp:22] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
But it doesn’t seem to affect the training since the result is as good as it is.
But I would still like to know what the probable cause is and how to solve it :grinning:

I simply use: model = torch.nn.DataParallel(model) to enable multi-GPUs.
Besides, I also added torch.multiprocessing.set_start_method('spawn', force=True) in my code. Don’t know whether it has any effect on this.

Thank you for the answer in advance.

Are you seeing the same message, if you remove the multiprocessing code (I assume this is triggering it)?

Sorry for my late response and thank you for the reply!
I pinned the trigger of this warning. It happens if I set num_workers > 0 in torch.utils.data.DataLoader, instead of being caused by nn.DataParallel().
Any idea how to fix it? Is it probably due to that I set torch.multiprocessing.set_start_method('spawn', force=True)?

I have issued the same problem when setting num of workers to more then 0. @ptrblck, any solution ?

Unfortunately, I don’t have suggestions, as I don’t fully understand the use case of using torch.multiprocessing as well as multiple workers (which itself will use multiple processes), so could you explain the use case a bit more?

Hi @ptrblck, I came across the same issue and was intrigued by your remark:

For one, isn’t it good to be able to process/load the data using a separate process, apart from the training process(es), in order to not tax them with extra processing load? Further, although it’s not my use case, loading the data might involve some heavy preprocessing, warranting even more than one data loader workers per training process; e.g. in the case in of some vision applications.

Maybe you have a completely different idea about this and give us your insight in to this topic.

Thanks in advance.

I totally agree with you and also think that multiple processes by themselves are useful, should be used for the data loading and process, and would yield a speedup in the overall training pipeline.
This can be easily achieved with the num_workers argument of the DataLoader.
My comment might not have been clear enough, but I was wondering about the use case to use the multiprocessing package manually in the training script as well as multiple workers (which will again spawn multiple processes, so you would end up with a “nested” multiprocessing workload), as I would imagine that using the simpler approach of setting num_workers>=1 should already do the job.

Ah, ok, sorry, I didn’t notice the part about using the multiprocessing package separately. I think this is because I had the same issue just using DDP and Dataloaders with num_workers > 0. I get these errors just before my script exits; I already am a good citizen by using dist.barrier() to wait for all training processes to complete before exiting.

When I set the num_workers to zero the script exits smoothly without any errors. In the past I have often used multiple DataLoader workers in a DDP context and never had this issue. I’m wondering if this is something that sneaked in to a recent Pytorch release or is related to recent a CUDA version?

My setup:

| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4     |
$ pip show torch
Name: torch
Version: 1.10.0

This is on a 4x V100 machine with Ubuntu.

It might also be in my script of course.

Ok, I dove a bit more in to this, and found the issue. In this particular instance I am using a model library where the author unfortunately had added the batch collate function as a method to the model class. Because of this the model self is also transported to the DataLoader workers.

When the training scripts exits the DataLoader worker processes are killed too, without properly dealing with the Tensors that are copied to the worker.

I refactored all the batch collating code out of the model class and thus made it independent of the self of the model instance. This resolved the issue, exiting the script is very smooth now!

So, if anyone gets this error, it could be because your are copying the model over to the DataLoader workers without knowing it.

1 Like

Hi @visionscaper, I’m encountering the same issue. I’m currently using pytorch lightning to write my model. When I set the num_worker > 0 in the dataloader, the warning will occur. I’m a beginner in pytorch. Could you give me some example that how did you refactor your batch collating code out of model class? I don’t know how to do it.

Hi @sjtuhl, if you have a similar issue to mine, somehow your model is linked to your dataset, collate_fn or other object given to a DataLoader. Because of this, when you set num_worker > 0, the model is copied to the DataLoader workers. For instance, if your dataset is a property of your model class instance (which is just bad design), not only your dataset is copied to the workers, but everything it is “attached” to as well.

I hope this makes sense to you. I can’t really help you without inspecting your code.

Hi @visionscaper, I have the same issue as you. My data collate fn is written inside the dataset model. There are some tensors my collate_fn need, stored in the dataset model. I wonder if there is a way to avoid moving the collate_fn out and escape the error?
:hugs:

This warning is caused by tensors (or other objects) that are not being properly cleaned up (released) on the CUDA device before the process they belonged to was terminated.

As it was for me in my case, and seems to be the case for mostly everyone else, the culprit behind this warning has to do with moving objects to, or owning objects on, the CUDA device from inside of one of the DataLoader worker processes, and the worker process being terminated before the objects are released from CUDA.

Moving tensors to CUDA inside a collate_fn that is passed to a DataLoader where num_workers > 0 - which just means that a child process will be created for each worker - is an example of this, and is exactly what I was doing. In my training loop then, after all the training samples were iterated through, my DataLoader worker processes would be terminated and I would get this warning.

As tempting as it is to move batches to CUDA inside the collate_fn, and though it may even provide some performance gains, the potential for a memory leak or some other resource management issue along with the annoying warning made me rethink my strategy. I ended up just relocating this logic to the first line in my training loop (on CPU - main process) where I move each batch to CUDA before my forward pass which I think is generally recommended anyway - never looked back since.

Alternatively, if you really wanted to, you could keep whatever objects you want on the CUDA device inside your DataLoader worker processes if you set persistent_workers=True in your DataLoader constructor and not encounter this warning until your DataLoader is entirely deallocated. You may encounter other issues though like running out of memory on the GPU, but I think for some use cases, maybe doing this could be beneficial.

Hope this helps.

2 Likes