I noticed that in the source of DataParallel, it is implemented by threading:
I reply my question in https://github.com/pytorch/pytorch/issues/3917
Is that Global Interpreter Lock makes DataParallel working slower?
I modified my code with multiprocessing and the speed up 4x with 4-gpu. I wander is that the threading making DataParallel slower?
I guess it will depend on the size of the net you are using.
If it is very small, then your will spend most time executing python code and thus using threading will slow you down.
Otherwise, most of the time will be spent running stuff on the GPU anyway and so there is no use for multiple processes. In that case threads are used as they are much cheaper to create.
I tested with CondenseNet：
It takes almost 80ms one batch with 32 image(224x224).
Is that CondenseNet too small? How do you think?
Because most time is spent on CUDA kernels which don’t have GIL anyways, and transmitting CUDA tensors across processes is very tricky.