DistributedDataParallel deadlock

valayDave · August 1, 2020, 5:18am

I actually found a solution to my problem of why the workers were freezing at the end of training. I realised that the Data Partition from which mini-batches were created had an imbalanced between workers. Due to this one worker went into deadlock because of gradient syncing.

So one needs to ensure the same number of minibatches by altering batch size to fit with the workers.