Training stuck after first epoch


I am using two GPUs and six number_workers to train my model. But it hangs after 1st epoch as the screenshot below. How can I solve this issue? many thanks.

Maybe you can use a break in your training loop to early skip the first epoch and verify whether the second epoch can be executed correctly.

According to my experience, this seems like something is wrong in your data processing (dataset, dataloader or datasampler). You can check some of these parts or provide some demo code to reproduce this situation.


It keeps showing errors caused from site-packages files when I debug my code deeply line by line.
But when I run the original code, it’s training successfully~~~ It’s a bit weird, because I only add a new robot environment.

Anyway, it’s training finally. Thanks for your help!

Best regards