Use memory of all gpus


I have a server equipped with 4 Titan X GPU; The problem is: when I run my code, it reports “cuda runtime error (2) : out of memory”. The output of gpustat tells me that only gpu0’s memory is being used.

The corresponding tensorflow code can use all memory from 4 gpu, which is 48G in my case.

Anyone can help ?



You most certainly want to use a DataParallel wrapper around your network.

1 Like

Yes, either model or data parallelism. See also Model parallelism in Multi-GPUs: forward/backward graph

Hi albanD

I just noticed there is a need to use DataParallel. However, simply wrapping my network with it cause trouble:

File:…/torch/nn/parallel/ line 67, in parallel_apply raise output
IndexError:index 5 is out of range for dimension 0 (of size 5)

How I wrap it:
net = torch.nn.DataParallel(net, device_ids=[0, 1, 2, 3]).cuda()

I am traceing back to code. Please let me know if you have any idea.


May be this extra .cuda() on the DataParallel wrapper is causing the problem.
Take a look on the CUDA semantics and DataParallel.

Have you tried this way ?

device = ("cuda" if torch.cuda.is_available() else "cpu" )
model = net()
if torch.cuda.device_count() > 1:
    # device_ids has a default : all
    model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3])

It reports the same error, even if I try the code snippet you provided. Thanks anyway.

I am suspecting the reason is I am using for loop in the code. My input format is [batch, time, height, width, channel]. My code do for loop on time axis.

FYI, code runs fine on CPU.


can you show me your code ?

Thanks for following up with me.

First, the code takes up so much memory because I save result of every epoch into list and forget to release them like below:

class Network(nn.Module):
    def __init__(self):
        self.conv1 = nn.Conv2d()
        self.conv2 = nn.Conv2d()
        self.conv3 = nn.Conv2d()
        self.feature_list = []

    def forward(self, X):
        x = self.conv3(self.conv2(self.conv1(X)))

       return self.feature_list

For dataparallel issue, as I suspected, my code has for loop for temporal/time axis, which I put in the first axis like [time, batch, channel, height, width]. So I have total 20 time-step with 4 gpu, DataParallel would split my input to 4 shares, each one with length 5. Thus when I try to index to 6th element, it report index out of range error.

Thanks a lot

1 Like

Great, I couldn’t see how did you released the memory, but very nice to know you found the problem. :slight_smile: