Hi All,
I am trying to train SSD from this public repo on PASCAL VOC dataset fallowing the instructions in the repo.
When I try with num_workers = 8 and batch_size = 128 it works fine, but when I increase num_workers to 12, it is throwing the fallowing error.
I read online out of memory might be issue, but I have plenty of RAM available. Anyone else facing similar issue? Can any one help me solve it?
Loading base network...
Initializing weights...
Loading Dataset...
Training SSD on VOC0712
Traceback (most recent call last):
File "train.py", line 232, in <module>
train()
File "train.py", line 171, in train
images, targets = next(batch_iterator)
File "/home/XXXX/.virtualenvs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 206, in __next__
idx, batch = self.data_queue.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
File "/home/XXXX/.virtualenvs/pytorch/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
return recvfds(s, 1)[0]
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 160, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
System configuration : 4 Nvidia 1080 Ti’s, 384 GB RAM, 40 CPU cores.
Environment: PyTorch 0.1.12_2, CUDA 8.0.61 with patch 2, cuDNN 6.0
I have seen this, but couldn’t find a solution there.
TIA