Torch.distributed does not support part of the model training?

Hi, recently I tried to use torch.distributed package to train my model. In my case, I used one architecture like encoder-decoder. For accelerating my code, I pre-computed features of input data, but when I only trained decoder with calling model.module.decoder (input), it gave me some bug info like the following
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
1. (process_group: torch.distributed.ProcessGroup, grads_batch:List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
I wonder if someone can give me some suggestions? Does torch.distributed package could not work normally if part of the model doesn’t participate in computing process?

In v1.1, we added a new find_unused_parameters arg to DistributedDataParallel. If some of the model params are not involved in the forward pass, you can set find_unused_parameters to True.

1 Like