How to specify GPU usage?

I am training different models on different GPUs.

I have 4 GPUs indexed as 0,1,2,3

I try this way:

model = torch.nn.DataParallel(model, device_ids=[0,1]).cuda()

But actual process use GPU index 2,3 instead.

and if I use:

model = torch.nn.DataParallel(model, device_ids=[1]).cuda()

I will get the error:

RuntimeError: Assertion `THCTensor_(checkGPU)(state, 4, r_, t, m1, m2)’ failed. at /data/users/soumith/miniconda2/conda-bld/pytorch-cuda80-0.1.8_1486039719409/work/torch/lib/THC/generic/THCTensorMathBlas.cu:230

How to specify the GPU usage with index?

4 Likes

I am using Ubuntu 16.04. The GPU indexing are the same as you have.

If you want to execute xxx.py using only GPUs 0,1 in Ubuntu 16.04, use the following command as

CUDA_VISIBLE_DEVICES=2,3 python xxx.py

with nn.DadaParallel in xxx.py.

In addition, I don’t think that dataparallel accepts only one gpu.

7 Likes

Thanks a lot, it works :slight_smile:
Hope pytorch can integrate with this argument to specify gpu usage.

What’s your PyTorch version? It should accept a single GPU. How is that even possible that it uses last two GPUs if you specify device_ids=[0,1]?

If you run your script with CUDA_VISIBLE_DEVICES=2,3 it will always execute on the last two GPUs, not on the first ones. I can’t see how that helps in this case. CUDA_VISIBLE_DEVICES=0,1 would make more sense.

7 Likes

I am using pytorch 0.1.9 and Ubuntu 16.04.

When I use CUDA_VISIBLE_DEVICES=2,3 (0,1), ‘nvidia-smi’ tells me that gpus 0,1 (2,3) are used.

I do not know the reason, but the gpu id used in nvidia-smi and the gpu id used in pytorch are reversed.

You can check it if you use Ubuntu 16.04.

2 Likes

I think it is more likely a cuda/nvidia problem.
I have met this problem before when using Caffe with Tesla K10/K80 GPUs.

@Seungyoung_Park from my experience, it’s usually nvidia-smi that is reversed with everything else.
For example, on my machine, the numbering from pytorch agrees with the numbering of the deviceQuery nvidia sample (and any cuda program for that matter) while nvidia-smi is the only one giving a different numbering.

4 Likes

Thanks a lot for your answering.
My pytorch version is 0.1.8.
There may be a numbering problem of GPU device, but it does not affect our usages.
My problem is about how to allocate GPU usages, now everything is fine :slight_smile:

I’m curious about this as well. Can you currently use fractional GPU usage as in tensorflow? The tf equivalent is something like this:

    with tf.device(FLAGS.device):
        gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.device_percentage)
        sess_cfg = tf.ConfigProto(allow_soft_placement=FLAGS.allow_soft_placement,
                                  gpu_options=gpu_options)
2 Likes

How does one use GPUs if one has a custom NN class (that inherits from torch.nn.Module)?

For example, I know that using the easy example from (http://pytorch.org/tutorials/beginner/pytorch_with_examples.html) one can just change the type of the tensors being created:

dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU

however, when using things like torch.nn.Linear and also Variable, how does one make sure to use GPUs?

Also, do I really have to track how GPUs are assigned, I am fine with torch just doing its stuff automagically.

In particular I would love to see how:

http://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_module.html#sphx-glr-beginner-examples-nn-two-layer-net-module-py

is made into a GPU version of it.

Related SO question: https://stackoverflow.com/questions/45553613/how-does-one-make-sure-that-everything-is-running-on-gpu-automatically-in-pytorc

Dear All,

I have installed Nvidia Cuda 9.0 toolkit with Cudnn to my ubuntu machine.
I have installed pytorch when i am trying to check for gpu usage by running the below code -

Code-

import torch
print(torch.rand(2,3).cuda())

I am getting the below error:


RuntimeError Traceback (most recent call last)
in ()
----> 1 print(torch.rand(2,3).cuda())

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py in cuda(self, device, async)
67 else:
68 new_type = getattr(torch.cuda, self.class.name)
—> 69 return new_type(self.size()).copy
(self, async)
70
71

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/init.py in _lazy_new(cls, *args, **kwargs)
385 # We need this method only for lazy init, so we can remove it
386 del _CudaBase.new
–> 387 return super(_CudaBase, cls).new(cls, *args, **kwargs)
388
389

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCStorage.cu:58

I think pytorch is not communicating with the Nvidia GPU, please advise.

Regards
Saurabh Jha

This error might occur after you installed CUDA etc. without restarting your machine.
Have you rebooted after the driver installation?

Yes you are correct, it was fine after i restart the machine

for a Unix command soln you can also do:

export CUDA_VISIBLE_DEVICES=$i

though of course that only works if the scripts are independent and stuff like that…otherwise the other solutions here are probably better…

CUDA_VISIBLE_DEVICES=$i python main.py

hi, do you have the answer?

Is there anyone who knows that…

When I attach below code in python file(in main.py),

import os
os.environment["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environment["CUDA_VISIBLE_DEVICES"] = "0"

it does’t work the same as CUDA_VISIBLE_DEVICES=0 python main.py do.

The former one doesn’t specify(divide) GPU but, the latter one works well.

It seems strange to me.

Thanks ahead.

I wouldn’t recommend the first approach, since you would have to make sure these lines of code are imported before any other library, which might take the GPU. If some script imports PyTorch and these lines are executed afterwards, they won’t have any effect anymore.

The second approach makes sure to mask the devices before running the Python script.

2 Likes

Totally understand thanks!!

1 Like
  1. try CUDA_VISIBLE_DEVICES=0,1,2,3 xxx.py to specify GPU
  2. add os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" in you python code
2 Likes