Accumulate Gradient accelerator

Shia_Edwards · March 4, 2024, 5:29am

    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps) 
    effective_batch_size = args.batch_size // args.gradient_accumulation_steps

    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=effective_batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True,
        drop_last=True,
    )

    for epoch in range(init_epoch, args.num_epoch + 1):
        #model.train()
        for iteration, (x, y) in enumerate(data_loader):
            x_0 = x.to(device, dtype=dtype, non_blocking=True)
            y = None if not use_label else y.to(device, non_blocking=True)
            #model.zero_grad()
            if is_latent_data:
                z_0 = x_0 * args.scale_factor
            else:
                z_0 = first_stage_model.encode(x_0).latent_dist.sample().mul_(args.scale_factor)
            # sample t
            t = torch.rand((z_0.size(0),), dtype=dtype, device=device)
            t = t.view(-1, 1, 1, 1)
            z_1 = torch.randn_like(z_0)
            # 1 is real noise, 0 is real data
            z_t = (1 - t) * z_0 + (1e-5 + (1 - 1e-5) * t) * z_1
            u = (1 - 1e-5) * z_1 - z_0
            # estimate velocity
            v = model(t.squeeze(), z_t, y)
            loss = F.mse_loss(v, u)     
            with accelerator.accumulate(model):
              loss = loss.mean()
              accelerator.backward(loss)   
            optimizer.step()
            scheduler.step()
            model.zero_grad()
            global_step += 1
            log_steps += 1
            optimizer.zero_grad()

It seems like I haven’t successfully invoked gradient accumulation. If I understand correctly, if I decrease the batch size from 128 to 32 and set gradient accumulation to 4, a typical GPU should be able to run it, especially in Colab.

However, if it still fails to run, it likely indicates that gradient accumulation hasn’t been correctly invoked. Although my program runs successfully, I believe it’s related to this line of code: effective_batch_size = args.batch_size // args.gradient_accumulation_steps, as it effectively reduces the batch size, which is unrelated to gradient accumulation.

I’m unsure if this step is necessary or how to modify it to correctly invoke gradient accumulation. Some have suggested that I comment out #model.zero_grad() inside the for loop because it effectively clears all gradients. However, commenting it out doesn’t seem to have any effect.

ref：

github.com

VinAIResearch/LFM/blob/main/train_flow_latent.py

# ---------------------------------------------------------------
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
#
# This work is licensed under the NVIDIA Source Code License
# for Denoising Diffusion GAN. To view a copy of this license, see the LICENSE file.
# ---------------------------------------------------------------

import argparse
import os
import shutil
from functools import partial
from time import time

import torch
from omegaconf import OmegaConf

import torch.nn.functional as F
import torch.optim as optim
import torchvision
from accelerate import Accelerator

This file has been truncated. show original