During the training of the first epoch, the program got “killed” after 957/2354 batches and there was no other message.

The dataloader was using the default num_workers=0.
The code ran well when I ran through a smaller dataset.
I’m using GPU.
I did set optimizer.zero_grad() at the beginning of each batch within the dataloader loop.
Is it the issue of GPU memory?
Do I need to set torch.cuda.empty_cache() after each batch?

Here’s the versions I am using.

# Name                    Version                   Build  Channel
pytorch                   1.2.0           py3.6_cuda9.2.148_cudnn7.6.2_0    pytorch
torchvision               0.4.0                 py36_cu92    pytorch

Please let me know if you have any suggestions. Thank you.

Do you run out of RAM during your training? This message usually comes from the system running out of memory and starting killing process (or you doing kill -9 PID).

I don’t think that’s the case because the RAM has 1T total so I think it’s unlikely to run out of RAM. I didn’t start any killing process either.

Is there some limitations on how much RAM a process of your user is allowed to use on your machine (from cgroup on linux or similar mechanisms)?

No. I am the only one user and there is no memory limit for a process.

How do you check for the absence of limits on processes?

Does this mean no limit of max of memory?
$ ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 4127735
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4127735
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

You can try and follow the advices here to try and pinpoint the root cause for this.


Hi, I got the same problem as you. Have you solved this issue now?

Hello I am getting the same problem in epoch 2
RuntimeError: DataLoader worker (pid 26810) is killed by signal: Killed.
Anyone has suggestions ?