Hello, I’m working on a Triplet model for semantic search, and I experience huge problems with training it using DistributedDataParallel module. Essentially model looks like that (actual model is much more complicated, as it contains i.e., Transformer inside, but for the sake of simplicity below is a simplified scheme):
> class TripletModel(nn.Module):
> def __init__(self, encoder):
> super(TripletModel, self).__init__()
> self.encoder = encoder
>
> def forward(self, x1, x2, x3):
> anchor_emb = self.encoder(x1)
> positive_pair_emb = self.encoder(x2)
> negative_pair_emb = self.encoder(x3)
> return anchor_emb, positive_pair_emb, negative_pair_emb
representations for triplets are created by the same encoder, and compared with TripletMarginLoss and cosine distance as distance metric. I wanted to distribute the training on two GPUs (on the same node) using DistributedDataParallel, but during the training I receive following error:
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons:
1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 99 with name backbone.encoder.encoder.layer.5.output.LayerNorm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
Have anyone worked on similar problem? I’m training the model using HuggingFace Trainer API, with gradient checkpointing enabled, and find_unused_parameters parameter in DistributedDataParallel set to False (I have also test inside the training loop that checks for unused parameters, so there shouldn’t be any). Most of the discussions I’ve found online (i.e., does Gradient checkpointing support multi-gpu ? · Issue #63 · allenai/longformer · GitHub) point to this particular parameter (find_unused_parameters), but it is already set to False, and for sure there are no additional unused parameters
Additionally, if I turn off the gradient_checkpointing, I receive following error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [4, 447]] is at version 3; expected version 2 instead.
Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
The backtrace points to embedding layer of token_type_embeddings layer in Roberta model of Huggingface Transformers (modeling_roberta.py module) :
[...]lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 132, in forward
token_type_embeddings = self.token_type_embeddings(token_type_ids)
[...]
[...]lib/python3.6/site-packages/torch/nn/functional.py", line 2043, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
During training on single GPU I receive no errors, and the training normally progresses