Why pytorch is getting killed during training on larger dataset on AWS EC2 instances

frequiem11 · August 12, 2025, 11:57am

I’m kind of new to training models so sorry if it is a blatantly bad question. We are training semantic segmentation model (PidNET) on AWS EC2 instances using pytorch. Our default parameter values for num_worker=8 and batch_size=8 are given. When our dataset exceeds 10.000 images, with given parameters, pytorch gets killed without any error message. Also AWS Instance gets shutdown so I get to reboot the instance.

Training works for num_worker=4 and batch_size=4 values when our datasets contain 10.000-14.000 images. However, when our datasets exceeds 14.000 images, the python script gets killed again without no errors or warnings whatsoever.

We have plenty of storage so it should not be an issue. GPU Memory is also sufficient as I check with nvidia-smi -l 10 command. What could be the issue?

PyDevC · August 12, 2025, 2:56pm

what about RAM?

How much size is the image dataset?

spiegelball · August 13, 2025, 7:59am

Are you using docker? If so, check shm_size in Advanced configuration | GitLab Docs

frequiem11 · August 13, 2025, 1:10pm

After check, It gets killed when RAM is full.

frequiem11 · August 13, 2025, 1:11pm

Yes I am using docker. After checking out, when the system RAM is full, it gets killed.

spiegelball · August 13, 2025, 2:37pm

Did you set the variable in the runner?

[runners.docker]
    image = "ruby:3.1"
    gpus = "all"
    shm_size = 2073741824
...

It needs a pretty high value.

PyDevC · August 13, 2025, 3:21pm

yes I also encountered this when trying to run GitHub - lllyasviel/FramePack: Lets make video diffusion practical!

well I could not found a solution other than buy more ram