Gradient accumulation convergence slowing

Hello,

I use 3D Unet that including Conv3d,GroupNorm,Dropout3d layers for medical image segmentation,input size is 256x240x256x1

in order to train the model,i use Mixed Precision Training and Gradient accumulation Training from here CUDA Automatic Mixed Precision examples — PyTorch 1.13 documentation

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast(device_type='cuda', dtype=torch.float16):
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

i find Gradient accumulation Training method lead to slow convergence.
Has anyone encountered this situation?