Accumulate Gradient accelerator

    accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps) 
    effective_batch_size = args.batch_size // args.gradient_accumulation_steps

    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=effective_batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True,
        drop_last=True,
    )

    for epoch in range(init_epoch, args.num_epoch + 1):
        #model.train()
        for iteration, (x, y) in enumerate(data_loader):
            x_0 = x.to(device, dtype=dtype, non_blocking=True)
            y = None if not use_label else y.to(device, non_blocking=True)
            #model.zero_grad()
            if is_latent_data:
                z_0 = x_0 * args.scale_factor
            else:
                z_0 = first_stage_model.encode(x_0).latent_dist.sample().mul_(args.scale_factor)
            # sample t
            t = torch.rand((z_0.size(0),), dtype=dtype, device=device)
            t = t.view(-1, 1, 1, 1)
            z_1 = torch.randn_like(z_0)
            # 1 is real noise, 0 is real data
            z_t = (1 - t) * z_0 + (1e-5 + (1 - 1e-5) * t) * z_1
            u = (1 - 1e-5) * z_1 - z_0
            # estimate velocity
            v = model(t.squeeze(), z_t, y)
            loss = F.mse_loss(v, u)     
            with accelerator.accumulate(model):
              loss = loss.mean()
              accelerator.backward(loss)   
            optimizer.step()
            scheduler.step()
            model.zero_grad()
            global_step += 1
            log_steps += 1
            optimizer.zero_grad()

It seems like I haven’t successfully invoked gradient accumulation. If I understand correctly, if I decrease the batch size from 128 to 32 and set gradient accumulation to 4, a typical GPU should be able to run it, especially in Colab.

However, if it still fails to run, it likely indicates that gradient accumulation hasn’t been correctly invoked. Although my program runs successfully, I believe it’s related to this line of code: effective_batch_size = args.batch_size // args.gradient_accumulation_steps, as it effectively reduces the batch size, which is unrelated to gradient accumulation.

I’m unsure if this step is necessary or how to modify it to correctly invoke gradient accumulation. Some have suggested that I comment out #model.zero_grad() inside the for loop because it effectively clears all gradients. However, commenting it out doesn’t seem to have any effect.

ref: