Different training results when train data input to the model as eval mode first

Hi, I experimented the training phase. When dataloader loads a data for training, I put the data to the model as eval mode first with torch.no_grad() to get some data. And then I put the same data to the model for real training. But the training speed is quite different. When I don’t put the data to the model for imformation, only putting the data for training, it trains properly. But When I put the trick, the model doesn’t train well as if some gradient issue happened… What is the problem?

main code

 for data in dataloader:

        for i in range(len(model)):
            data[i][0] = data[i][0].to(device, non_blocking=True)
            data[i][1] = data[i][1].to(device, non_blocking=True)

        with torch.cuda.amp.autocast():

            for i in range(len(model)):

                attention_rollout=VITAttentionRollout(model[i], head_fusion='max', discard_ratio=0.9)
                mask.append(attention_rollout(data[i][0]))
                del attention_rollout

            for i in range(len(model)):
                model[i].train(mode=set_training_mode)
                output = model(data[i][0])
            

VITattentionRollout Code

class VITAttentionRollout:
    def __init__(self, model, attention_layer_name='attn_drop', head_fusion="mean",
                 discard_ratio=0.9):
        self.model = model
        self.head_fusion = head_fusion
        self.discard_ratio = discard_ratio
        self.attentions = []
        self.handles=[]
        self.attention_layer_name=attention_layer_name

    def __call__(self, input_tensor):
        self.attentions = []
        self.model.eval()
        with torch.no_grad():
            output = self.model(input_tensor)