I got this warning at the end of each epoch when using multiple GPUs:
[W CudaIPCTypes.cpp:22] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
But it doesn’t seem to affect the training since the result is as good as it is.
But I would still like to know what the probable cause is and how to solve it
I simply use:
model = torch.nn.DataParallel(model) to enable multi-GPUs.
Besides, I also added
torch.multiprocessing.set_start_method('spawn', force=True) in my code. Don’t know whether it has any effect on this.
Thank you for the answer in advance.
Are you seeing the same message, if you remove the multiprocessing code (I assume this is triggering it)?
Sorry for my late response and thank you for the reply!
I pinned the trigger of this warning. It happens if I set num_workers > 0 in torch.utils.data.DataLoader, instead of being caused by nn.DataParallel().
Any idea how to fix it? Is it probably due to that I set
I have issued the same problem when setting num of workers to more then 0. @ptrblck, any solution ?
Unfortunately, I don’t have suggestions, as I don’t fully understand the use case of using
torch.multiprocessing as well as multiple workers (which itself will use multiple processes), so could you explain the use case a bit more?
Hi @ptrblck, I came across the same issue and was intrigued by your remark:
For one, isn’t it good to be able to process/load the data using a separate process, apart from the training process(es), in order to not tax them with extra processing load? Further, although it’s not my use case, loading the data might involve some heavy preprocessing, warranting even more than one data loader workers per training process; e.g. in the case in of some vision applications.
Maybe you have a completely different idea about this and give us your insight in to this topic.
Thanks in advance.
I totally agree with you and also think that multiple processes by themselves are useful, should be used for the data loading and process, and would yield a speedup in the overall training pipeline.
This can be easily achieved with the
num_workers argument of the
My comment might not have been clear enough, but I was wondering about the use case to use the
multiprocessing package manually in the training script as well as multiple workers (which will again spawn multiple processes, so you would end up with a “nested” multiprocessing workload), as I would imagine that using the simpler approach of setting
num_workers>=1 should already do the job.
Ah, ok, sorry, I didn’t notice the part about using the
multiprocessing package separately. I think this is because I had the same issue just using DDP and Dataloaders with num_workers > 0. I get these errors just before my script exits; I already am a good citizen by using
dist.barrier() to wait for all training processes to complete before exiting.
When I set the
num_workers to zero the script exits smoothly without any errors. In the past I have often used multiple DataLoader workers in a DDP context and never had this issue. I’m wondering if this is something that sneaked in to a recent Pytorch release or is related to recent a CUDA version?
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 |
$ pip show torch
This is on a 4x V100 machine with Ubuntu.
It might also be in my script of course.
Ok, I dove a bit more in to this, and found the issue. In this particular instance I am using a model library where the author unfortunately had added the batch collate function as a method to the model class. Because of this the model
self is also transported to the DataLoader workers.
When the training scripts exits the DataLoader worker processes are killed too, without properly dealing with the Tensors that are copied to the worker.
I refactored all the batch collating code out of the model class and thus made it independent of the
self of the model instance. This resolved the issue, exiting the script is very smooth now!
So, if anyone gets this error, it could be because your are copying the model over to the DataLoader workers without knowing it.
Hi @visionscaper, I’m encountering the same issue. I’m currently using pytorch lightning to write my model. When I set the num_worker > 0 in the dataloader, the warning will occur. I’m a beginner in pytorch. Could you give me some example that how did you refactor your batch collating code out of model class? I don’t know how to do it.
Hi @sjtuhl, if you have a similar issue to mine, somehow your model is linked to your
collate_fn or other object given to a
DataLoader. Because of this, when you set
num_worker > 0, the model is copied to the
DataLoader workers. For instance, if your
dataset is a property of your model class instance (which is just bad design), not only your dataset is copied to the workers, but everything it is “attached” to as well.
I hope this makes sense to you. I can’t really help you without inspecting your code.