Why pytorch is getting killed during training on larger dataset on AWS EC2 instances

I’m kind of new to training models so sorry if it is a blatantly bad question. We are training semantic segmentation model (PidNET) on AWS EC2 instances using pytorch. Our default parameter values for num_worker=8 and batch_size=8 are given. When our dataset exceeds 10.000 images, with given parameters, pytorch gets killed without any error message. Also AWS Instance gets shutdown so I get to reboot the instance.

Training works for num_worker=4 and batch_size=4 values when our datasets contain 10.000-14.000 images. However, when our datasets exceeds 14.000 images, the python script gets killed again without no errors or warnings whatsoever.

We have plenty of storage so it should not be an issue. GPU Memory is also sufficient as I check with nvidia-smi -l 10 command. What could be the issue?

what about RAM?

How much size is the image dataset?

Are you using docker? If so, check shm_size in Advanced configuration | GitLab Docs

After check, It gets killed when RAM is full.

Yes I am using docker. After checking out, when the system RAM is full, it gets killed.

Did you set the variable in the runner?

[runners.docker]
    image = "ruby:3.1"
    gpus = "all"
    shm_size = 2073741824
...

It needs a pretty high value.

yes I also encountered this when trying to run GitHub - lllyasviel/FramePack: Lets make video diffusion practical!

well I could not found a solution other than buy more ram :sweat_smile: