Training stops after first epoch

I am working on compute canada platform. Initialiy had pytorch 1.10 Cuda 11.4 installed in the torch environment. I was successfully able to run the RetinaNet model for 200 epoch on a dataset of 465 training images. however when I ran the model for next 50 epochs I got CUDA OOM on the same cloud cluster. The tech team asked me to switch to torch1.9 and cUDA 10.1. The CUDA OOM error was gone but the training doesn’t go beyond first epoch and no error message displayed. The job shows timeout because it doesnt move beyond first epoch. I am using the resources as under
no. of gpus -2
no. of workers -6
memory 64G
no. of nodes -1
total images -465 , reduced the batch size to 4

I have been stuck here for long and not found any solution . Ay help will be highly appreciated.

Thanks in advance.

@Eshta could you please share the code and environment(pytorch version/libraries)? Also, why are there 6 workers on 2 gpus?