I tried the ImageNet example with ResNet152 on 8GPUs but it is much slower than fb.resnet.torch (1.5s vs 0.8s per iter).
The replicate in DataParallel could be the bottleneck that costs half of the forward time. The Broadcast function is called for every parameter and buffer, while in fb.resnet.torch, the parameters are flattened first and the bcast is only called once.
It’s elegant to implement the Broadcast as an Op/Function. I wonder if it is possible to overlap the communication with computation during forward/backward? Or it is necessary to flatten the parameters in order to improve the efficiency?
We’re still working on that. It’s true that
replicate can add some overhead for larger networks, and that’s why we also have a
DataParallel module. The overhead of separate broadcasts isn’t very large from what we’ve seen. Nevertheless, if it’s quite high, we’re going to overlap the transfers with the computation, so the overhead is going to be even smaller if we were to sync the flattened parameters.
We don’t support flattening the parameters. It’s quite complex and bug prone to do that correctly, and maintain flexibility.
The main problem is that with 8 GPUs we’re a bit limited because of Python’s GIL (only one thread can execute Python code at a time). That’s why we’re working on moving more of logic commonly used in the vision networks, as well as some others, to our C backends, so they can proceed without blocking other threads.