I’m kind of new to training models so sorry if it is a blatantly bad question. We are training semantic segmentation model (PidNET) on AWS EC2 instances using pytorch. Our default parameter values for num_worker=8
and batch_size=8
are given. When our dataset exceeds 10.000 images, with given parameters, pytorch gets killed without any error message. Also AWS Instance gets shutdown so I get to reboot the instance.
Training works for num_worker=4
and batch_size=4
values when our datasets contain 10.000-14.000 images. However, when our datasets exceeds 14.000 images, the python script gets killed again without no errors or warnings whatsoever.
We have plenty of storage so it should not be an issue. GPU Memory is also sufficient as I check with nvidia-smi -l 10
command. What could be the issue?