Model takes twice the memory footprint with distributed data parallel

jastern33 · September 1, 2021, 6:20pm

Cross-posting from stackoverflow, because it wasn’t getting much attention there. There’s an open bounty, and if anyone answers over there, I’m happy to award it to you.

The question is: when I use distributed data parallel, I see double the memory usage (almost exactly) on distributed data parallel compared to single-GPU training - it looks like two copies of the model are being stored on every GPU. Why does the model take up twice the space in DDP? Is it intended behavior? Is there a way to avoid this extra memory usage?

Here is a minimal working example.

import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch

def train(rank, gpu_list, train_distributed):
    
    device_id = gpu_list[rank]

    model = torch.nn.Linear(1000, 1000)
    print(device_id, torch.cuda.memory_allocated(device_id))
    model.to(device_id)
    print(device_id, torch.cuda.memory_allocated(device_id))

    print(device_id, torch.cuda.memory_allocated(device_id))
    if train_distributed:
        # convert model to DDP
        dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
        model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
    print(device_id, torch.cuda.memory_allocated(device_id))

def train_distributed():
    gpu_list = [torch.device(i) for i in [5, 6]]
    os.environ['MASTER_ADDR'] = '127.0.01'
    os.environ['MASTER_PORT'] = '7676'
    mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)

if __name__ == '__main__':
    # First test one GPU
    print("Single GPU")
    train(0, [torch.device(5)], False)
    print("Multi GPU")
    # Then test multiple GPUs
    train_distributed()

Output:

Single GPU
cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
Multi GPU
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704

I tried rewriting this snippet using the command-line version of DDP, using torch.distributed.launch, and saw the same issue.

zetyquickly · September 3, 2021, 3:26am

That’s really interesting question. I’d also would like to know why.

As I can see not all models take twice as much memory on the card but Linear(1000, 1000) definitely does. Also it’s seen that with 3 cards available it takes the same amount of GPU memory, so it’s kind of invariant of the number of GPUs greater than 1

jastern33 · September 3, 2021, 3:29am

Interesting, do you have an example model that I could try which doesn’t have the duplicated memory issue?

zetyquickly · September 3, 2021, 3:35am

Try to change your Linear(1000, 1000) to this model from the tutorials:

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

Logs will tell

...
cuda:1 2048
cuda:1 3072
...

Looks like it took reserves blocks of 1024

zetyquickly · September 3, 2021, 3:43am

Hello @ptrblck, may we attract your attention to it?

ptrblck · September 3, 2021, 3:59am

You are most likely seeing the same effect described here.

jastern33 · September 3, 2021, 4:05am

Not sure how I missed this thread – this is exactly what I am seeing. Thanks for the help, I will now account for an extra copy of the gradients in memory when training with DDP.