Now I am using keras with a simplified interface of data parallisim sync version of parallelism.
Now I wish to know if pytorch has similar function, seems like PyTorch is trying to advocate in “Speed”, so it had better support Multi-GPU just for a single node.
Oh that’s super nice, have to give it a try later. So, basically I just need to wrap a torch.nn.DataParallel around my model and it’s good to go!? Neat.
PyTorch is fully powered to efficiently use Multiple GPUs for accelerated deep learning.
We integrate efficient multi-gpu collectives such as NVIDIA NCCL to make sure that you get the maximal Multi-GPU performance.
When using two GPUs, the speed is
Train Epoch: 0 [900/18745 (0.048)]; Acc: 0.929; time cost: 0.722
When using one GPU, the speed is
Train Epoch: 0 [15800/18745 (0.843)]; Acc: 0.905; time cost: 0.461
The codes are written as follows:
model = models.resnet18(pretrained=True)
model =torch.nn.DataParallel(model).cuda()
x=x.cuda(async=True)# there is no difference no matter whether we include async=True or not
yt=yt.cuda(async=True)#
output = model(x)
When using two GPUs, the output is recorded as follows:
+------------------------------------------------------+
| NVIDIA-SMI 352.79 Driver Version: 352.79 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 Off | 0000:06:00.0 Off | 0 |
| 0% 56C P0 74W / 250W | 2440MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M40 Off | 0000:87:00.0 Off | 0 |
| 0% 37C P0 87W / 250W | 1854MiB / 11519MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16788 C python 1874MiB |
| 0 56331 C python 298MiB |
| 0 58531 C python 207MiB |
| 1 16788 C python 1797MiB |
+-----------------------------------------------------------------------------+
When using one GPU, the output is recorded as follows:
+------------------------------------------------------+
| NVIDIA-SMI 352.79 Driver Version: 352.79 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 Off | 0000:06:00.0 Off | 0 |
| 0% 71C P0 233W / 250W | 3878MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M40 Off | 0000:87:00.0 Off | 0 |
| 0% 26C P8 18W / 250W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 33037 C python 3312MiB |
| 0 56331 C python 298MiB |
| 0 58531 C python 207MiB |
+-----------------------------------------------------------------------------+
I don’t know what code are you using to benchmark that, but the numbers seem quite off. Multi-GPU on 2 GPUs should be pretty much the same as with Lua Torch right now (which is fast).
Parts of the codes were brought from the example of ImageNet training in PyTorch.
The speed on Pytorch is similar with that on torch, indeed.
What I wonder is why two GPUs run slower than one GPU?
If you have very small batches or a model that can’t even ully utilize a single GPU, using many GPUs will only add communication overhead, without benefits.
If I define some math operations around Tensors and Variables which are not in torch.nn, can they be performed on multi-GPU (I think this is supported in Tensorflow)? @smth