During the training of the first epoch, it's killed

lin4mation · September 18, 2019, 3:35pm

Hi,

During the training of the first epoch, the program got “killed” after 957/2354 batches and there was no other message.

The dataloader was using the default num_workers=0.
The code ran well when I ran through a smaller dataset.
I’m using GPU.
I did set optimizer.zero_grad() at the beginning of each batch within the dataloader loop.
Is it the issue of GPU memory?
Do I need to set torch.cuda.empty_cache() after each batch?

Here’s the versions I am using.

# Name                    Version                   Build  Channel
pytorch                   1.2.0           py3.6_cuda9.2.148_cudnn7.6.2_0    pytorch
torchvision               0.4.0                 py36_cu92    pytorch

Please let me know if you have any suggestions. Thank you.

albanD · September 18, 2019, 3:43pm

Hi,

Do you run out of RAM during your training? This message usually comes from the system running out of memory and starting killing process (or you doing kill -9 PID).

lin4mation · September 18, 2019, 3:53pm

Hi,
I don’t think that’s the case because the RAM has 1T total so I think it’s unlikely to run out of RAM. I didn’t start any killing process either.

albanD · September 18, 2019, 3:57pm

Is there some limitations on how much RAM a process of your user is allowed to use on your machine (from cgroup on linux or similar mechanisms)?

lin4mation · September 18, 2019, 3:59pm

No. I am the only one user and there is no memory limit for a process.

albanD · September 18, 2019, 5:41pm

How do you check for the absence of limits on processes?

lin4mation · September 18, 2019, 6:13pm

Does this mean no limit of max of memory?
$ ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 4127735
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4127735
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

albanD · September 18, 2019, 7:14pm

You can try and follow the advices here to try and pinpoint the root cause for this.

imaginist · July 29, 2020, 1:30am

Hi, I got the same problem as you. Have you solved this issue now?

Saida2020 · December 1, 2021, 8:41am

Hello I am getting the same problem in epoch 2
RuntimeError: DataLoader worker (pid 26810) is killed by signal: Killed.
Anyone has suggestions ?