pytorch 0.4.1 | dataloader worker is killed by signal

Jiahang_WANG · August 2, 2019, 1:14pm

While training the model, for the beginning several epochs, all is ok. But aftering running for several epochs, the thread is killed. The error shows ‘Dataloader workers is killed by a signal: bus error’ .

However, there is definitely enough GPU memory showed by ‘nivida-smi -l’.

Besides, I need to do many data processing, including open several images and resize them or embedding some data. This error is often appear after finishing validation. I don’t know the real reason. Really need your help.

Thanks in advance.

ptrblck · August 2, 2019, 1:43pm

This error might be thrown, if you don’t have enough shared memory using multiple workers.
Are you using a docker container? If so, use --ipc=host while launching it.
Otherwise, try to increase your shared memory in your system.

Jiahang_WANG · August 2, 2019, 2:49pm

Thanks for your reply. I didn’t use docker. I wonder if use torch.cuda.empty_cache after validation would help or not?

Besides, the memory is shown as follows:

          total        used        free      shared  buff/cache   available

Mem: 257662 41849 166048 44896 49764 169666
Swap: 16383 3492 12891