I noticed that in the source of DataParallel, it is implemented by threading:
I reply my question in https://github.com/pytorch/pytorch/issues/3917
Is that Global Interpreter Lock makes DataParallel working slower?
I modified my code with multiprocessing and the speed up 4x with 4-gpu. I wander is that the threading making DataParallel slower?