Cross-posting from stackoverflow, because it wasn’t getting much attention there. There’s an open bounty, and if anyone answers over there, I’m happy to award it to you.
The question is: when I use distributed data parallel, I see double the memory usage (almost exactly) on distributed data parallel compared to single-GPU training - it looks like two copies of the model are being stored on every GPU. Why does the model take up twice the space in DDP? Is it intended behavior? Is there a way to avoid this extra memory usage?
Here is a minimal working example.
import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch
def train(rank, gpu_list, train_distributed):
device_id = gpu_list[rank]
model = torch.nn.Linear(1000, 1000)
print(device_id, torch.cuda.memory_allocated(device_id))
model.to(device_id)
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
if train_distributed:
# convert model to DDP
dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
print(device_id, torch.cuda.memory_allocated(device_id))
def train_distributed():
gpu_list = [torch.device(i) for i in [5, 6]]
os.environ['MASTER_ADDR'] = '127.0.01'
os.environ['MASTER_PORT'] = '7676'
mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)
if __name__ == '__main__':
# First test one GPU
print("Single GPU")
train(0, [torch.device(5)], False)
print("Multi GPU")
# Then test multiple GPUs
train_distributed()
Output:
Single GPU
cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
Multi GPU
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704
I tried rewriting this snippet using the command-line version of DDP, using torch.distributed.launch, and saw the same issue.