Does it support Multi-GPU card on a single node?

Pab_Peter · January 19, 2017, 8:57pm

Hi,

Now I am using keras with a simplified interface of data parallisim sync version of parallelism.

Now I wish to know if pytorch has similar function, seems like PyTorch is trying to advocate in “Speed”, so it had better support Multi-GPU just for a single node.

Thanks,
Shawn

smth · January 19, 2017, 9:36pm

Hi Shawn,

Yes we support multi-GPU on a single machine.

Check out our examples:

https://github.com/pytorch/examples/tree/master/imagenet
https://github.com/pytorch/examples/tree/master/dcgan

Also check out corresponding documentation:

http://pytorch.org/docs/nn.html#multi-gpu-layers

rasbt · January 21, 2017, 11:43pm

Oh that’s super nice, have to give it a try later. So, basically I just need to wrap a torch.nn.DataParallel around my model and it’s good to go!? Neat.

PS: maybe it’s worth mentioning the multi-GPU in the Readme or so (e.g,. as a GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration subsection). For instance, as Adam Paszke wrote on GitHub - apaszke/pytorch-dist

Multi-GPU ready

PyTorch is fully powered to efficiently use Multiple GPUs for accelerated deep learning.
We integrate efficient multi-gpu collectives such as NVIDIA NCCL to make sure that you get the maximal Multi-GPU performance.

smth · January 22, 2017, 12:39am

@rasbt thanks, we’ll add that in the next week or two – we plan to thoroughly benchmark ourselves and add that.

Kalamaya · February 1, 2017, 7:05am

Hmm, have you figured out how to use DataParallel? I cant for the life of me get it to work! :-/

smth · February 1, 2017, 6:29pm

@Kalamaya have a look at examples when in doubt. the imagenet example and the dcgan example both have example uses:

phenixcx · February 14, 2017, 9:48am

When using two GPUs, the speed is
Train Epoch: 0 [900/18745 (0.048)]; Acc: 0.929; time cost: 0.722

When using one GPU, the speed is
Train Epoch: 0 [15800/18745 (0.843)]; Acc: 0.905; time cost: 0.461

The codes are written as follows:

model = models.resnet18(pretrained=True)  
model =torch.nn.DataParallel(model).cuda()
x=x.cuda(async=True)# there is no difference no matter whether we include async=True or not
yt=yt.cuda(async=True)#
output = model(x)

When using two GPUs, the output is recorded as follows:

+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 0000:06:00.0     Off |                    0 |
|  0%   56C    P0    74W / 250W |   2440MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           Off  | 0000:87:00.0     Off |                    0 |
|  0%   37C    P0    87W / 250W |   1854MiB / 11519MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16788    C   python                                        1874MiB |
|    0     56331    C   python                                         298MiB |
|    0     58531    C   python                                         207MiB |
|    1     16788    C   python                                        1797MiB |
+-----------------------------------------------------------------------------+

When using one GPU, the output is recorded as follows:

+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 0000:06:00.0     Off |                    0 |
|  0%   71C    P0   233W / 250W |   3878MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           Off  | 0000:87:00.0     Off |                    0 |
|  0%   26C    P8    18W / 250W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     33037    C   python                                        3312MiB |
|    0     56331    C   python                                         298MiB |
|    0     58531    C   python                                         207MiB |
+-----------------------------------------------------------------------------+

How can we improve the efficiency using two GPUs?

apaszke · February 14, 2017, 11:15am

I don’t know what code are you using to benchmark that, but the numbers seem quite off. Multi-GPU on 2 GPUs should be pretty much the same as with Lua Torch right now (which is fast).

phenixcx · February 14, 2017, 12:05pm

Parts of the codes were brought from the example of ImageNet training in PyTorch.
The speed on Pytorch is similar with that on torch, indeed.
What I wonder is why two GPUs run slower than one GPU?

apaszke · February 14, 2017, 6:07pm

If you have very small batches or a model that can’t even ully utilize a single GPU, using many GPUs will only add communication overhead, without benefits.

phenixcx · February 15, 2017, 1:39am

Got it! Thank you very much!

wangg12 · March 26, 2017, 2:13pm

If I define some math operations around Tensors and Variables which are not in torch.nn, can they be performed on multi-GPU (I think this is supported in Tensorflow)? @smth

mfa · June 28, 2017, 2:35pm

Can you add some comments to these examples?