Does it support Multi-GPU card on a single node?

Hi,

Now I am using keras with a simplified interface of data parallisim sync version of parallelism.

Now I wish to know if pytorch has similar function, seems like PyTorch is trying to advocate in “Speed”, so it had better support Multi-GPU just for a single node.

Thanks,
Shawn

1 Like

Hi Shawn,

Yes we support multi-GPU on a single machine.

Check out our examples:

https://github.com/pytorch/examples/tree/master/imagenet
https://github.com/pytorch/examples/tree/master/dcgan

Also check out corresponding documentation:

http://pytorch.org/docs/nn.html#multi-gpu-layers

1 Like

Oh that’s super nice, have to give it a try later. So, basically I just need to wrap a torch.nn.DataParallel around my model and it’s good to go!? Neat.

PS: maybe it’s worth mentioning the multi-GPU in the Readme or so (e.g,. as a GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration subsection). For instance, as Adam Paszke wrote on GitHub - apaszke/pytorch-dist

Multi-GPU ready

PyTorch is fully powered to efficiently use Multiple GPUs for accelerated deep learning.
We integrate efficient multi-gpu collectives such as NVIDIA NCCL to make sure that you get the maximal Multi-GPU performance.

1 Like

@rasbt thanks, we’ll add that in the next week or two – we plan to thoroughly benchmark ourselves and add that.

1 Like

Hmm, have you figured out how to use DataParallel? I cant for the life of me get it to work! :-/

@Kalamaya have a look at examples when in doubt. the imagenet example and the dcgan example both have example uses:

When using two GPUs, the speed is
Train Epoch: 0 [900/18745 (0.048)]; Acc: 0.929; time cost: 0.722

When using one GPU, the speed is
Train Epoch: 0 [15800/18745 (0.843)]; Acc: 0.905; time cost: 0.461

The codes are written as follows:

model = models.resnet18(pretrained=True)  
model =torch.nn.DataParallel(model).cuda()
x=x.cuda(async=True)# there is no difference no matter whether we include async=True or not
yt=yt.cuda(async=True)#
output = model(x)

When using two GPUs, the output is recorded as follows:

+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 0000:06:00.0     Off |                    0 |
|  0%   56C    P0    74W / 250W |   2440MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           Off  | 0000:87:00.0     Off |                    0 |
|  0%   37C    P0    87W / 250W |   1854MiB / 11519MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16788    C   python                                        1874MiB |
|    0     56331    C   python                                         298MiB |
|    0     58531    C   python                                         207MiB |
|    1     16788    C   python                                        1797MiB |
+-----------------------------------------------------------------------------+

When using one GPU, the output is recorded as follows:

+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 0000:06:00.0     Off |                    0 |
|  0%   71C    P0   233W / 250W |   3878MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           Off  | 0000:87:00.0     Off |                    0 |
|  0%   26C    P8    18W / 250W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     33037    C   python                                        3312MiB |
|    0     56331    C   python                                         298MiB |
|    0     58531    C   python                                         207MiB |
+-----------------------------------------------------------------------------+

How can we improve the efficiency using two GPUs?

I don’t know what code are you using to benchmark that, but the numbers seem quite off. Multi-GPU on 2 GPUs should be pretty much the same as with Lua Torch right now (which is fast).

Parts of the codes were brought from the example of ImageNet training in PyTorch.
The speed on Pytorch is similar with that on torch, indeed.
What I wonder is why two GPUs run slower than one GPU?

If you have very small batches or a model that can’t even ully utilize a single GPU, using many GPUs will only add communication overhead, without benefits.

4 Likes

Got it! Thank you very much! :slight_smile:

If I define some math operations around Tensors and Variables which are not in torch.nn, can they be performed on multi-GPU (I think this is supported in Tensorflow)? @smth

Can you add some comments to these examples?