8 k80 GPU trains a classification model always stopped at epoch1, with no exception logged

Hey guys tried to fine tune a bert classification model with 1.06 million data, but the training always stuck after it finishes the first epoch with no exception logged. can anybody see what is the problem with this? or any debugging idea? thanks a lot!

Could you remove the dataloading and check if your training routine would get stuck after training on random data?
If that’s not the case, could you iterate over the DataLoader alone without the model training and see, if this might cause the issue?