Can I specific GPU ,not GPU:0

4 GPUs on my machine ,GPU 0 and 1 is running other’s code with nearly full memory usage.

So I use GPU 2 and 3. But when I run it ,it still reports

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 10.92 GiB total capacity; 10.21 GiB already allocated; 89.50 MiB free; 9.64 MiB cached)

Here I post my dataparallel code:

os.environ['CUDA_DEVICE_ORDER']='PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES']='2,3'
#this function is listed below
model=DataParallelModel(model,cuda=0,device_ids=[0,1],output_device=0)
device=torch.device('cuda:0')

#This DataParallelModel function wrap the DataParallel class 
def DataParallelModel(model,**kwargs):
        if 'device_ids' in kwargs.keys():
            device_ids = kwargs['device_ids']
        else:
            device_ids = None
        if 'output_device' in kwargs.keys():
            output_device = kwargs['output_device']
        else:
            output_device = None
        if 'cuda' in kwargs.keys():
            cudaID = kwargs['cuda']
            device=torch.device('cuda:{}'.format(cudaID))
            model = torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).to(device)
        else:
            model = torch.nn.DataParallel(model, device_ids=device_ids, output_device=output_device).cuda()
        return model

These environ code do works,they makes my code sees gpu 2 and 3.But it still fail for such error mentioned before.
My Pytorch version is the newly released 1.0, I just upgraded it yesterday morning.

If you don’t need to specify GPU ordering for any special reason just use

model=DataParallelModel(model)

It’s agnostic to device ids

I modified, it still not work
I changed all “to(device)” to “cuda()” and model=DataParallelModel(model)
It reports

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 10.92 GiB total capacity; 10.21 GiB already allocated; 89.50 MiB free; 9.64 MiB cached)

If it’s the case, are you sure you are choosing the correct devices? you can see if you can manually allocate tensors
a=torch.rand(100).cuda(idx) manually changing idx

and check if the idx match the nvidia-smi or in your case the pci order

The result is as shown:
without os.environ[DEVICE_ORDER]=PCI_BUS_ID,torch.cuda performs same device order as nvidia-smi order.
with this env variable, the cuda and nvidia-smi order is changed

cuda(0)----gpu2
cuda(1)----gpu3
cuda(2)----gpu0
cuda(3)----gpu1

and the same error reports.

the same error means ,I commented the PCI_ORDER line to make cuda order same as nvidia-smi order,and other code unchanged, the torch.rand did create tensor on it specific gpu.But same error occured while I run the trainning code.

Well I am afraid that if you still have these issue then it’s a bug cos CUDA_VISIBLE_DEVICES is very standard.

U consider it a machine bug or sys bug or code bug?
I’m using ubuntu 16.04 with cuda9.0 and cudnn7.0.5, pytorch 1.0
As mentioned before ,the os env is able to make my code see the gpu and exclude others.and then ,I can use .cuda() or to() to transfer data on valid gpus and run it.Is this situation i described right?Cuz I don’t know where to debug, I need to confirm such process correctness.

That’s theoretically correct. If you set CUDA_VISIBLE_DEVICES PyTorch is unable to see the other gpus

Therefore, and if you are sure you are not commiting a mistake assigning cuda visible devices, you are facing a either bug or a wrong library setup.

Can you check if, after setting cuda visible devices you cannot access to the others?
you can use torch.cuda.device_count()

I 've debugged,and I found the mistake,I post it here ,hope to get a clear explanation from U cos I really puzzled why it happened here.
It my forward pass. I create 8 residual block with name res1 to res8,and in forward pass I use

#res block 
        def make_resblock(in_channels, out_channels, kernel_size, stride, padding):
            return nn.Sequential(
                nn.ReflectionPad2d(padding),
                nn.Conv2d(in_channels, out_channels, kernel_size, stride),
                nn.BatchNorm2d(out_channels),
                nn.ReLU(),
                nn.ReflectionPad2d(padding),
                nn.Conv2d(in_channels, out_channels, kernel_size, stride),
                nn.BatchNorm2d(out_channels),
                nn.ReLU()
            )
        self.res1=make_resblock(256,256,3,1,1)
#others are same
#this part in  forward pass
 for i in range(1,9):
        res = input
        input=getattr(self,'res{}'.format(i))(input)
        input=res+input

when i commented this part ,the code runs correctly.
But I’m puzzled ,Why? is there sth wrong?

the os env sets right,is my code fault ,but I dont know the reason

Cos maybe you are making the net small enough to fit in the residual space of the in-use gpu. That’s why I tell you to check if cuda_visible_devices works.

If it works you just wrongly assigned the devices

I ve checked .

  • cuda_visible_devices works fine, it exclude the other gpu,I confirmed by create tensor on the excluded gpu order number and torch.cuda APIs.
  • the in-use gpu usage stay unchanged when i run the commented code(comments the residual for loop it runs),and the gpus I exposed, their usage increases from 0 to 5Gb,both two gpu have data flowed in.

and I rewrite the for loop to small code lines, I use only res1 and res2 ,it works fine ,but with for loop code ,it fails

I tested again, i can only use 3 res blocks, if more ,it reports an error with exactly same number and words.

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 10.92 GiB total capacity; 9.63 GiB already allocated; 223.50 MiB free; 458.66 MiB cached)

no matter 4,5,6,7 or 8 blocks i use in forward pass, it report this error, and the numbers are exactly same

Well so the logical conclusion is that the problem you reported is not related to other gpus, but the model you are using + input size is bigger than the GPU capacity

1 Like