I think this problem is batch_size need increase to use efficiently multiple GPUs but outputs increase either.
replicas … batch_size / GPUs * input_size
output_device … batch_size * GPUs * output_size
but output_device must have replica in the current DataParallel imprementation.
Therefore I try rewrite DataParallel to output_device avoid from replicas.
In my case, I could increase batch size to output_device GPU use 30GB and each replica GPUs uses 20GB.
But, I’m not sure it’s correct.
I know, it need resolve hide latency to get more efficiency process.
best solution is change to concurrent processing.
I have used PyTorch since last week replace from TensorFlow2.
I want to know best practice of PyTorch.