Traceback (most recent call last):
File "process.py", line 73, in <module>
eval('../amazon/densenet169_new',128,256,2)
File "/home/zhenghuabin/jianglibin/amazon/eval.py", line 49, in eval
for step, (data, target) in enumerate(validate_loader):
File "/home/zhenghuabin/anaconda3/envs/py35/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 206, in __next__
idx, batch = self.data_queue.get()
File "/home/zhenghuabin/anaconda3/envs/py35/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
File "/home/zhenghuabin/anaconda3/envs/py35/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/home/zhenghuabin/anaconda3/envs/py35/lib/python3.5/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/zhenghuabin/anaconda3/envs/py35/lib/python3.5/multiprocessing/reduction.py", line 181, in recv_handle
return recvfds(s, 1)[0]
File "/home/zhenghuabin/anaconda3/envs/py35/lib/python3.5/multiprocessing/reduction.py", line 160, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
Here is my code:
dset_validate = AmazonDateset_validate(validate_index)
validate_loader = DataLoader(dset_validate, batch_size=batch_size, num_workers=4)
for step, (data, target) in enumerate(validate_loader):
It always get error in the same position(finish some particular epoch). How to solve it?
This problem happen in some particular machine, I mean it happens on one of my machines while the same code works well on other servers, and I can get rid by “num_workers=0”, but it makes my training too slow.
Most of the solutions here deal with the symptoms, but not the cause.
This link explains the cause (a memory leak when storing to much data from a dataloader) and how to solve it.
I copy the code from the aforementioned link:
pred_list = []
target_list = []
# long version
for inputs, targets in DataLoader(dataset, num_workers=6, batch_size=64):
pred_list.append(model.predict_on_batch(inputs)) # make model prediction
targets_copy = deepcopy(targets)
target_list.append(targets_copy)
del inputs
del targets
With larger test sets not using multiprocessing may be a lot slower.
Btorb’s solution the better solution, the essential part here is to deepcopy the targets generated.