Finding the cause of RuntimeError: Expected to mark a variable ready only once

dfilipiak · June 18, 2021, 8:32am

I’m extending a complex model (already with DistributedDataParallel with find_unused_parameters set to True) in PyTorch on detectron2.

I’ve added a new layer generating some additional output to the original network - initially, that layer was frozen (requires_grad = False) and everything was working fine. I later decided to unfreeze this layer. Unfortunately, this results in this error on multiple GPUs:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
...
  File "/xxx/trainer.py", line 585, in some_method:
    losses.backward()
  File "/xxx/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/xxx/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 
1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

Clearly, the error is related to that unfrozen layer, since everything was working before unfreezing it. This means that I need to check every change I made.

I wonder if there’s a way/trick to determine which particular tensor/operation causes this behaviour? In other words - how can I speed up the debugging process?

Asked the same question on SO here.

gcramer23 · June 21, 2021, 2:22pm

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

The runtime error reports the problem. Have you tried checking changes related to the 2 possible errors?

Also, did you use the python pdb debugger?

dfilipiak · June 21, 2021, 2:26pm

Yes, I tried checking all my changes - I write the post since I can’t find the cause in my code. I also tried debugging with VSCode if that matters.

gcramer23 · June 23, 2021, 3:47pm

Hi, can you try first setting the environment variable TORCH_DISTRIBUTED_DEBUG=INFO. If you need more detail, you can set it to TORCH_DISTRIBUTED_DEBUG=DETAIL. You will need to use nightly builds with this environment variable set.

dfilipiak · June 24, 2021, 12:28pm

Thanks - it helped me to move forward. Turns out this feature is available in PyTorch 1.9.0 (was using 1.7.0), so no need to use nightly builds.

Now I have this error:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 73 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

With setting TORCH_DISTRIBUTED_DEBUG to DETAIL I also have :

Parameter at index 73 with name roi_heads.box_predictor.xxx.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

So, technically, there is a problem with roi_heads.box_predictor.xxx. Could not find one. However, with version 1.9.0 the console also outputs this:

[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

Turned out to be the root cause. Switching find_unused_parameters to False makes the training run normally. Not sure why this does work now, but I don’t mind.

Yanli_Zhao · June 25, 2021, 11:09pm

‘Parameter at index 73 with name roi_heads.box_predictor.xxx.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.’

==> does your program use some activation checkpoint?

breakds · December 4, 2021, 1:17am

I ran into this problem as well - find_unused_parameters=True may cause a parameter to be marked twice. The problem is that in my case find_unused_parameters=True is actually needed.

@Yanli_Zhao Sorry about my ignorance here - I am wondering why activation checkpoint is related to autograd engine hooks? Thanks!

EchoShao8899 · December 13, 2021, 10:37am

Hi! Have you solved the problem?
I also met the problem described in the main post. And in my case find_unused_parameters=True is necessary. Not sure how to solve it.
Thanks!

Saurabh_Kataria · December 21, 2021, 10:39pm

I have a setup in which there are two models (f, g) (I’m training f(g)). If I keep find_unused_parameters=True for both models, there is error like yours. When I use find_unused_parameters=True for only the first model, error disappears.

Orient_Zdf · January 20, 2022, 11:59pm

Hi Saurabh,

I got the same error and I am using similar structure. May I know how to set the ‘find_unused_parameters=True’ for only the first model?

Thank you very much and I’m looking forward to the reply.

David

Saurabh_Kataria · January 21, 2022, 6:56pm

model_1 = DDP(model_1, find_unused_parameters=True, device_ids=[rank]
model_2 = DDP(model_2, find_unused_parameters=False, device_ids=[rank]
Hopefully, this works for you.

Chiang97912 · December 7, 2022, 11:03am

Traceback makes it clear that this is due to “multiple checkpoint functions to wrap the same part of your model”, so you can solve this problem by turning off the PyTorch checkpoint mechanism.

MingChaoXu · April 26, 2023, 12:59pm

Hi, I got the same error, i use pcgrad algorithm to optimize multitask learning. i have two task, so i need to backward two loss, so this algorithm must get two grads in a shared layer (is this means “marked as ready twice”?), how can i solve this question

ptrblck · April 26, 2023, 7:42pm

You could try to use the no_sync context manager which disables gradient synchronization across DDP processes and will accumulate the gradients instead.

lr-lpl · April 27, 2023, 2:59am

I use TORCH_DISTRIBUTED_DEBUG to DETAIL.then appeared: Parameter at index 1068 with name dishsi.module.classifier.2.actlayer.conv.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

dishsi.module.classifier.2.actlayer.conv.bias Is this parameter reused? How to solve it ? and it seems that I have not positioned it in my model

lr-lpl · May 9, 2023, 10:51am

have you deal it？bro

Chiang97912 · May 29, 2023, 4:01am

In huggingface you turning off checkpoint mechanism by comment the code: model.gradient_checkpointing_enable()

ronghanghu · October 17, 2023, 2:39am

I encountered this same issue, and was able to solve it using use_reentrant=True in torch.utils.checkpoint.checkpoint (I’m using PyTorch 2.0)

Matt_Yoon · November 20, 2023, 7:37pm

use_reentrant is set to True by default, you need to set is to False manually. Nonetheless, this did solve my issue. Thanks.

Hurizma1003 · March 10, 2024, 12:37pm

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 206 with name bin_layer.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.



import math
from typing import Optional, Tuple
from transformers import AdamW, get_linear_schedule_with_warmup, AutoConfig
from transformers import BertForPreTraining, BertModel, RobertaModel, AlbertModel, AlbertForMaskedLM, RobertaForMaskedLM
import torch
import torch.nn as nn
import pytorch_lightning as pl
from sklearn.metrics import f1_score
from dataclasses import dataclass
import pytorch_lightning as pl
from transformers import AdamW, get_linear_schedule_with_warmup, AutoConfig

@dataclass
class ModelOutput():
    loss: Optional[torch.FloatTensor] = None
    all_loss: Optional[list] = None
    loss_nums: Optional[list] = None
    prediction_logits: torch.FloatTensor = None
    seq_relationship_logits: torch.FloatTensor = None
    tri_label_logits: torch.FloatTensor = None
    reg_label_logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None

class AlignScore(nn.Module):

    def __init__(self, model='roberta-base',
                 using_pretrained=True,
                   *args, 
                   **kwargs):
        
        super(AlignScore, self).__init__() 
        # self.save_hyperparameters()
                       
        if 'roberta' in model:
            if using_pretrained:
                self.base_model = RobertaModel.from_pretrained(model)
                self.mlm_head = RobertaForMaskedLM.from_pretrained(model).lm_head
            else:
                self.base_model = RobertaModel(AutoConfig.from_pretrained(model))
                self.mlm_head = RobertaForMaskedLM(AutoConfig.from_pretrained(model)).lm_head

        self.bin_layer = nn.Linear(self.base_model.config.hidden_size, 2)


        self.dropout = nn.Dropout(p=0.1)
        self.mlm_loss_factor = 0.5
        self.need_mlm = True
        self.is_finetune = False
        self.ce_loss_fct1 =nn.CrossEntropyLoss(reduction='sum')
        self.ce_loss_fct2 =nn.CrossEntropyLoss(reduction='sum')

    def mse_loss(self, input, target, ignored_index=-100.0, reduction='mean'):
        mask = (target == ignored_index)
        out = (input[~mask]-target[~mask])**2
        if reduction == "mean":
            return out.mean()
        elif reduction == "sum":
            return out.sum()

    def forward(self,batch):
            # print(batch)
            base_model_output = self.base_model(
                input_ids = batch['input_ids'],
                attention_mask = batch['attention_mask'])
        
            prediction_scores = self.mlm_head(base_model_output.last_hidden_state) ## sequence_output for mlm
            seq_relationship_score = self.bin_layer(self.dropout(base_model_output.pooler_output)) ## pooled output for classification


            total_loss = None
            if 'mlm_label' in batch.keys():
                 
                masked_lm_loss = self.ce_loss_fct1(prediction_scores.view(-1, self.base_model.config.vocab_size), batch['mlm_label'].view(-1))
                next_sentence_loss = self.ce_loss_fct2(seq_relationship_score.view(-1, 2), batch['align_label'].view(-1)) / math.log(2)
                masked_lm_loss_num = torch.sum(batch['mlm_label'].view(-1) != -100)
                next_sentence_loss_num = torch.sum(batch['align_label'].view(-1) != -100)

            return ModelOutput(
                loss=total_loss,
                all_loss=[masked_lm_loss, next_sentence_loss, ]  if 'mlm_label' in batch.keys() else None,
                loss_nums=[masked_lm_loss_num, next_sentence_loss_num,] if 'mlm_label' in batch.keys() else None,
                prediction_logits=prediction_scores,
                seq_relationship_logits=seq_relationship_score


            )

import os
os.environ['TORCH_DISTRIBUTED_DEBUG'] = 'DETAIL'
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
import os
from src.summary_dataset import DSTDataSet
from src.summary_align_model import AlignScore

def ddp_setup(rank, world_size):
    """
    Args:
        rank: Unique identifier of each process
        world_size: Total number of processes
    """
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

import json
from tqdm import tqdm

                           
def calculate_forward_loss(model_output):
    
        losses = model_output.all_loss
        loss_nums = model_output.loss_nums
    
        assert len(loss_nums) == len(losses), 'loss_num should be the same length as losses'
    
        loss_mlm_num = torch.sum(loss_nums[0])
        loss_bin_num = torch.sum(loss_nums[1])

    
        loss_mlm = torch.sum(losses[0]) / loss_mlm_num if loss_mlm_num > 0 else 0.
        loss_bin = torch.sum(losses[1]) / loss_bin_num if loss_bin_num > 0 else 0.
    
        total_loss = 0.5 * loss_mlm + loss_bin 
        
        return total_loss

def train(model,epochs,train_loader,val_loader,ckpt_dir='./'):
    
        optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
        for epoch in range(1, epochs):
    
            with tqdm(train_loader, unit="batch") as tepoch:
                for i,batch in enumerate(tepoch):
                        
                    tepoch.set_description(f"Epoch {epoch}")
                    optimizer.zero_grad()
                    model_output = model(batch)
                    _loss = calculate_forward_loss(model_output)
                    _loss.backward()
                    print(align_model.bin_layer.bias.grad)

                    import pdb;pdb.set_trace()
                    
                    optimizer.step()
                    tepoch.set_postfix(train_loss=_loss.item())

def main(rank: int, 
         world_size: int, 
         save_every: int, 
         total_epochs: int, 
         batch_size: int):
    
    ddp_setup(rank, world_size)
             
    datapath='_sample_test.json'
    data = json.load(open(datapath))
    train_eval_split = 0.90
    need_mlm = True
    model_name='roberta-base'
    train_ds = DSTDataSet(dataset=data[:int(train_eval_split*len(data))], model_name=model_name, need_mlm=need_mlm)
             
    val_ds = DSTDataSet(dataset=data[int(train_eval_split*len(data)):], model_name=model_name, need_mlm=need_mlm)
             
    train_dl = DataLoader(train_ds, batch_size=batch_size, pin_memory=True,shuffle=False,sampler=DistributedSampler(train_ds))
             
    val_dl = DataLoader(val_ds, batch_size=batch_size, shuffle=False,sampler=DistributedSampler(val_ds))
             
    model = AlignScore(model_name).to(rank)
    print(model)
             
    model = DDP(model, device_ids=[rank],find_unused_parameters=True,static_graph=False)
             
    train(model,3,train_dl,val_dl)
             
    destroy_process_group()


if __name__ == "__main__":

    import argparse
    parser = argparse.ArgumentParser(description='simple distributed training job')
    parser.add_argument('--total_epochs',default=4, type=int, help='Total epochs to train the model')
    parser.add_argument('--save_every',default=1, type=int, help='How often to save a snapshot')
    parser.add_argument('--batch_size', default=4, type=int, help='Input batch size on each device (default: 32)')
    args = parser.parse_args()
    
    world_size = torch.cuda.device_count()
    print("total-gpus:",world_size)
    mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)```