Hi, I experimented the training phase. When dataloader loads a data for training, I put the data to the model as eval mode first with torch.no_grad() to get some data. And then I put the same data to the model for real training. But the training speed is quite different. When I don’t put the data to the model for imformation, only putting the data for training, it trains properly. But When I put the trick, the model doesn’t train well as if some gradient issue happened… What is the problem?
main code
for data in dataloader:
for i in range(len(model)):
data[i][0] = data[i][0].to(device, non_blocking=True)
data[i][1] = data[i][1].to(device, non_blocking=True)
with torch.cuda.amp.autocast():
for i in range(len(model)):
attention_rollout=VITAttentionRollout(model[i], head_fusion='max', discard_ratio=0.9)
mask.append(attention_rollout(data[i][0]))
del attention_rollout
for i in range(len(model)):
model[i].train(mode=set_training_mode)
output = model(data[i][0])
VITattentionRollout Code
class VITAttentionRollout:
def __init__(self, model, attention_layer_name='attn_drop', head_fusion="mean",
discard_ratio=0.9):
self.model = model
self.head_fusion = head_fusion
self.discard_ratio = discard_ratio
self.attentions = []
self.handles=[]
self.attention_layer_name=attention_layer_name
def __call__(self, input_tensor):
self.attentions = []
self.model.eval()
with torch.no_grad():
output = self.model(input_tensor)