Hi, I find it quite difficult to use DDP to train a model with an additional loss function outside the forward
function.
Training Procedure
The model (M, based on ProxylessNAS) has two sets of parameters,
- neural network weights W,
- architecture parameters (operator weights) A,
The steps to update A are,
- randomly sample a sub-network Msub, with parameters Wsub, based on the probability matrix A
-
loss1 =
Msub(data)
-
loss2
is directly calculated from A, e.g. Latency(A) = 3xA01 + 2xA02 loss = f(loss1, loss2); loss.backward()
- update A…
The Problem
I set find_unused_parameters=True
and it raises an error during backward propagation.
RuntimeError: Expected to mark a variable ready only once. This error is caused by use of a module parameter outside the `forward` function. The return value of the `forward` function is inspected by the distributed data parallel wrapper to figure out if any of the module's parameters went unused. If this is the case, it knows they won't receive gradients in a backward pass. If any of those parameters are then used outside `forward`, this error condition is triggered. You can disable unused parameter detection by passing the keyword argument `find_unused_parameters=False` to `torch.nn.parallel.DistributedDataParallel`.
The problem is since only a part of the model (Msub) is used in each iteration, DDP won’t get gradients of those parameters not belong to Wsub. If I set find_unused_parameters=False
, it will crash in next forward pass.
Does anyone have any idea to solve the problem?