CUDA: Out of memory error when using multi-gpu

Hi all,
I am trying to fine-tune the BART model from transformers for language generation on a custom dataset (30K examples of 256 length. <5MB on disk).

I have followed the Data parallelism guide. Here are the relevant parts of my code

args.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

if args.n_gpu > 1:
        model = nn.DataParallel(model)
model.to(args.device)

# Training
args.per_gpu_train_batch_size * max(1, args.n_gpu)

for step, batch in enumerate(epoch_iterator):
            model.train()
            batch = tuple(t.to(args.device) for t in batch)

I am facing a CUDA: Out of memory issue when using a batch size (per gpu) of 4 on 2 gpus. However training works fine on a single GPU. I am trying to train on 2 Titan-X gpus with 12GB memory.

This is the error message

RuntimeError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 11.93 GiB total capacity; 10.84 GiB already allocated; 289.81 MiB free; 277.07 MiB cached) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:267)

If it helps, I am using AdamW optimizer, with linear warmup.
I have tried setting CUDA_VISIBLE_DEVICES, which gives the same error.

Am I missing something?

nn.DataParallel will use more memory on the default device as described here.
We generally recommend to use nn.DistributedDataParallel with a single process per GPU to get the best performance.

1 Like

@ptrblck more memory compared to other gpus or more memory compared to if you were only using 1 gpu?
When I run my code with 1 gpu and batch size 16, it works. But when I run same code with same batch size using 2 gpus (with equal memory) I get out of memory error, and on GPU 1 not on GPU 0, which is strange because my default device is GPU 0.
My issue looks similar to one discussed here:

The default device should use more memory than the other GPUs, since e.g. you would increase the batch size.
Could you post your model definition so that we could take a look, please?

I made another post here

Here is my encoder

class Encoder(nn.Module):
  def __init__(self, input_channels, args):
    super().__init__()
    self.feature_size = args.fMRI_feature_size
    self.hidden_size = self.feature_size
    self.downsample = not args.no_downsample
    self.input_channels = input_channels
    self.two_d = args.fMRI_twoD
    self.end_with_relu = args.end_with_relu
    self.args = args
    init_ = lambda m: init(m,
                           nn.init.orthogonal_,
                           lambda x: nn.init.constant_(x, 0),
                           nn.init.calculate_gain('relu'))
    self.flatten = Flatten()

    if self.two_d:
        self.final_conv_size = 128 * 24 * 30
        self.final_conv_shape = (128, 24, 30)
        self.main = nn.Sequential(
            init_(nn.Conv2d(self.input_channels, 32, (9,10), stride=1)),
            nn.ReLU(),
            init_(nn.Conv2d(32, 64, (9,10), stride=1)),
            nn.ReLU(),
            init_(nn.Conv2d(64, 128, (8,9), stride=1)),
            nn.ReLU(),
            init_(nn.Conv2d(128, 128, (7,8), stride=1)),
            nn.ReLU(),
            Flatten(),
            init_(nn.Linear(self.final_conv_size, self.feature_size))
            #nn.ReLU()
        )
    else:
        self.final_conv_size = 10 * 24 * 30 * 12
        self.final_conv_shape = (10, 24, 30, 12)
        self.main = nn.Sequential(
            init_(nn.Conv3d(self.input_channels, 3, (9, 10, 4), stride=(1, 1, 1))),
            nn.ReLU(),
            init_(nn.Conv3d(3, 5, (9, 10, 3), stride=(1, 1, 1))),
            nn.ReLU(),
            init_(nn.Conv3d(5, 8, (8, 9, 3), stride=(1, 1, 1))),
            nn.ReLU(),
            init_(nn.Conv3d(8, 10, (7, 8, 2), stride=(1, 1, 1))),
            nn.ReLU(),
            Flatten(),
            init_(nn.Linear(self.final_conv_size, self.feature_size)),

            #nn.ReLU()
        )
    self.train()

def forward(self, inputs, fmaps=False):
    f5 = self.main[:6](inputs)
    f7 = self.main[6:8](f5)
    out = self.main[8:](f7)
    if self.end_with_relu:
        assert self.args.method != "vae", "can't end with relu and use vae!"
        out = F.relu(out)
    if fmaps:
        return {
            'f5': f5.permute(0, 2, 3, 1),
            'f7': f7.permute(0, 2, 3, 1),
            'out': out
        }
    return out

I was using slurm on server and I found out that
gpu 0 in nvidia smi != gpu :0 in the codeI made following two changes

device = torch.device(“cuda:” + “1”)
encoder = torch.nn.DataParallel(encoder, device_ids=[1, 0])

and now the error says gpu 0 is out of memory
so 0 in code is actually 1 , and vice versa.

so, the only issue remains that why using 2 gpus with DataParallel goes out of memory while using 1 gpu with same data and batch size doesn’t
Does doing DataParallel requires extra memory to manage parallelism ?

nn.DataParallel might create an overhead, thus we recommend to use DistributedDataParallel with a single process for each device.
This would be the fastest approach without creating the imbalanced memory usage.

could the PyTorch team fix it then? it’s important to have a single-process multi-device solution, especially as multi-GPU nodes get bigger and bigger, it’s less and less useful to do multi-node training. We need something simpler than DDP please

DDP is not only used for multi-node training, but is also speeding up single-node multi-GPU workloads.
The current proposal is to deprecate DataParllel and in this sense to ramp up the documentation on DDP.

yes I know! I’m just suggesting, since single-node multi-device training is the entry point to dist training, and will likely be enough for most users, why having it done with DDP (which is so insanely complex to get working), while torch.nn.DataParallel API is so compact and friendly. I know that DataParallel in its current form performs badly, but looking at things from the customer perspective, this is the experience the Pytorch team should aim for for DDP future. No torchrun, no MPI, no processes to spawn manually etc. It’s better to have a single-code, single process experience, like MXNet Gluon offers.. Again, I’m not saying it’s possible today in Pytorch, I’m just saying this is what you should aim for, so that the data-parallelism experience is better. Think about how easy it is to write a single-file PySpark data processing code and run it over multiple machines. Why things aren’t this simpler in the DL training world yet? We’re clearly lagging behind the big data world in terms of abstracting complexity

DDP can solve the imbalanced memory use, but it will use more memory in each GPU for communication, right?
In the case with large sample and large model (e.g. I can only use batch_size=1), is there any suggestion for parallel training?
Thanks a lot!

No, I don’t think additional data is needed as the gradients are calculated in any case. Why would this be the case?

Because when I ran the same program with DDP, it comsumes more GPU memory and threw an OOM error. Without DDP, the memory usage is about 18000M, but with DDP 24000M is not enough

That sounds like a lot of overhead, but might also depend on the actual model or your setup.
I’ve checked the memory overhead used for the buckets and see that the memory increases approx. from 13.4GB to 14.1GB for a resnet so unsure where your ~33% overhead is coming from

I checked with some code from others work, the problem is caused by not setting “torch.cuda.set_device()”. After setting this, DDP works fine.

Thanks a lot for your help!