Finding the cause of RuntimeError: Expected to mark a variable ready only once

Hurizma1003 · March 10, 2024, 12:40pm

i tried all solutions on setting static graph , unused parameters but no luck,
single gpu is running fine but getting error in multi-gpu training.

azj.n · April 17, 2024, 2:55pm

Hello, I am having the same problem with cagrad, how did you solve it?

azj.n · April 17, 2024, 3:25pm

does this work with MMDistributedDataParallel?

ptrblck · April 17, 2024, 4:23pm

I don’t know what MMDistributedDataParallel is and what the difference to DistributedDataParallel would be.

Michel-Liao · July 31, 2024, 6:30pm

Does DDP support multiple losses? It seems like calling loss1.backward(retain_graph=True then loss2.backward() wouldn’t work because of checkpointing. I’m getting the same Runtime Error.

tengerye · December 28, 2024, 7:04pm

I encounter the same issue using Accelerate library from huggingface. Could someone illustrates the root cause of this issue please?

Why would an unused parameter cause the problem only with DDP?

ptrblck · December 28, 2024, 9:16pm

DDP reduces the gradients of all ranks during the backward pass and is thus expecting to use valid .grad attributes of all properly registered parameters.