Alternating Parameters in DDP

ziqipang · March 11, 2024, 4:07am

Hi Guys,

My question and context might a bit long, but I think it will help a lot of people trying to train large models with limited GPU memory under a DDP setup. After trying a lot of stuff (flash-attention, fp16, etc.), I am trying to leverage the property that setting requires_grad=False saves GPU memory.

A most intuitive description of what I want is that:

I have a neural network with two layers layer1 and layer2.
I use two separate optimizers: opt1 for layer1 and opt2 for layer2.
During the training process, I alternate between the two optimizers and set the layers with requires_grad=False.

Then the following code works perfectly well under the single-GPU use-case and squeeze the training right below the GPU memory:

# initialization
net = ...
opt1, opt2 = ..., ...

# training
for step, data in enumerate(data_loader):
    # alternating the parameter sets
    if step % 2 == 0:
        for param in net.layer1.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in net.layer2.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt1
    else:
        for param in net.layer2.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in net.layer1.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt2
    
    loss = net(data)
    loss.backward()
    opt.step()
    opt.zero_grad()

Then I try to switch to multi-GPU training with DDP as below, adding something supporting the DDP. Please note that I wrap DDP after each alternation of parameter set so that DDP can correctly register and reduce the gradients.

# initialization
net = ...
ddp_net = DDP(model)
opt1, opt2 = ..., ...

for step, data in enumerate(data_loader):
    unwrap_net = unwrap(ddp_net) # remove the ddp wrapper
    # alternating the parameter sets
    if step % 2 == 0:
        for param in unwrap_net.layer1.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in unwrap_net.layer2.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt1
    else:
        for param in unwrap_net.layer2.parameters():
            param.requires_grad = True
            param.grad = torch.zeros_like(param.data)
        for param in unwrap_net.layer1.parameters():
            param.requires_grad = False
            del param.grad
        opt = opt2
    
    new_ddp_net = DDP(unwrap_net)
    del ddp_net
    ddp_net = new_ddp_net
    torch.cuda.empty_cache()
    
    loss = ddp_net(data)
    loss.backward()
    opt.step()
    opt.zero_grad()

Here’s the question: I found that [alternating the parameter set] occupies more GPU memory than [training only layer1 or layer2 alone with DDP]. Although it is just 1-2GB, I can no longer squeeze the large model into my GPU anymore ( ). So I am wondering:

Are there any place I am doing wrong that causes this issue?
Are there better approaches to alternating the parameter sets under a DDP setting?
Considering that I am using huggingface’s accelerate, is my observed issue not a pytorch one, actually?

If you have read through here, my friend, I am truly grateful for your patience and help! May the force of optimization be with you!

Best,

Ziqi