VRAM usage increase with more gradient accumulation steps

ipoletaev · May 25, 2023, 9:49pm

What would cause GPU memory usage increase with more gradient accumulation steps? Essentially inside the following loop

# outer loop
	optimizer.zero_grad(set_to_none=True)
	for micro_step in range(gradient_accumulation_steps):
	    batch = next(train_data_loader_iterator)
	    no_sync_enabled = micro_step < gradient_accumulation_steps - 1
	    with no_backward_sync_ctx(enabled=no_sync_enabled): # no-op for a single device
	        loss = model(**batch)
	        loss /= gradient_accumulation_steps
	        loss.backward(loss)
	        total_loss += loss.detach()
	optimizer.step()

I am assuming changing gradient_accumulation_steps shall not have any impact on the memory usage. Gradient accumulation shouldn’t create any new variables and all the parameter-specific updates should be accumulated in-place (e.g. update += new_update)?

In practice, however, while training the model on a single GPU, multiple GPUs with DDP, or FSDP - everywhere I observe higher VRAM usage for number of steps more than 1. Any idea whether it is expected? And if not - how to debug this “leak”?

Thanks!