Register_forward_hook function usage

Hello guyz.

I just want to get the middle output of my network and calculate the gradient.
So, I’ve found layer.register_forward_hook function.
My code is below:

global glb_feature_teacher
glb_feature_teacher = torch.tensor(torch.zeros(train_batch, num_emb), requires_grad=True, device=torch.device(device))

def Get_features4teacher(self, input, output):
global glb_feature_teacher
glb_feature_teacher = output.data
return None

t_emb_layer = teacher_net.module.linear1

In the training phase,

output = net(inputs)
t_emb_layer.register_forward_hook(Get_features4teacher)
emb_teacher = torch.tensor(glb_feature_teacher, requires_grad=True, device=torch.device(device))

mse_loss = nn.MSELoss()
loss = mse_loss(emb_teacher, some_vector)
loss.backward()

The code is ran well, and checked whole the middle outputs are extracted well(the value of emb_teacher)

But the problem is, gradient is not calculated.

When i print out the gradient of extracted layer, it prints out None.
That means gradients are not calculated.
The code is below:

grad_of_params_teacher = {}

for name, parameter in teacher_net.named_parameters():
grad_of_params_teacher[name] = parameter.grad

print('teacher: ', grad_of_params_teacher[‘module.linear1.weight’])

output: teacher: [None]

Can you tell me that what is the problem?
I realllly don’t know how to solve this problem.

Help me plz :slight_smile:

Please indent your code for better readability.
Torch does not provide seeing the gradients w.r.t intermediate layers, you have to use register_backward_hook to have access to them.

Here is my full code.

# Register hooking function
global glb_feature_teacher
global glb_feature_student
def Get_features4teacher(self, input, output):
   global glb_feature_teacher
   glb_feature_teacher = output.data
   return None
# end
def Get_features4student(self, input, output):
   global glb_feature_student
   glb_feature_student = output.data
   return None
# end


# Parsers
parser = argparse.ArgumentParser(description='PyTorch CIFAR10 Training')
parser.add_argument('--lr', default=0.1, type=float, help='learning rate')
parser.add_argument('--resume', '-r', action='store_true', help='resume from checkpoint')
args = parser.parse_args()

device = 'cuda' if torch.cuda.is_available() else 'cpu'
best_acc = 0  # best test accuracy
start_epoch = 0  # start from epoch 0 or last checkpoint epoch

# Data
print('==> Preparing data..')
transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

train_batch = 64
num_emb = 128

trainset = torchvision.datasets.CIFAR10(root='../../data', train=True, download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=train_batch, shuffle=True, num_workers=0)

testset = torchvision.datasets.CIFAR10(root='../../data', train=False, download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=train_batch, shuffle=False, num_workers=0)

classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Model
print('==> Building model..')

teacher_net = ResNet50()
student_net = StudentNet()

teacher_net = teacher_net.to(device)
student_net = student_net.to(device)

if device == 'cuda':
    teacher_net = torch.nn.DataParallel(teacher_net)
    student_net = torch.nn.DataParallel(student_net)
    cudnn.benchmark = True

print('Loading teacher, student network weight file')
try:
    checkpoint_teacher = torch.load('./resnet50.t7')
    teacher_net.load_state_dict(checkpoint_teacher['net'])

except FileNotFoundError:
    print('ERROR::No pretrained teacher network file found!')
    sys.exit(1)

t_emb_layer = teacher_net.module.linear1
s_emb_layer = student_net.module.classifier1

'''=============================parameter settings=========================='''
for param in student_net.parameters():
    param.requires_grad=True
for param in teacher_net.parameters():
    param.requires_grad=False
'''==============================LOSS FUNCTION LOCATION=============================='''
mse_loss = nn.MSELoss()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(student_net.parameters(), lr=args.lr, momentum=0.9 ,weight_decay=5e-4)

glb_feature_teacher = torch.tensor(torch.zeros(train_batch, num_emb), requires_grad=False, device=torch.device(device))
glb_feature_student = torch.tensor(torch.zeros(train_batch, num_emb), requires_grad=True, device=torch.device(device))

def train(epoch):
    global glb_feature_teacher
    global glb_feature_student

    print('\nEpoch: %d' % epoch)
    student_net.train()
    teacher_net.eval()

    train_loss = 0
    correct = 0
    total = 0

    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()

        outputs_teacher = teacher_net(inputs)
        outputs_student = student_net(inputs)
        
        '''============================================================================'''
        
        t_emb_layer.register_forward_hook(Get_features4teacher)
        s_emb_layer.register_forward_hook(Get_features4student)
        
        emb_teacher = torch.tensor(glb_feature_teacher, requires_grad=False, device=torch.device(device))
        emb_student = torch.tensor(glb_feature_student, requires_grad=True, device=torch.device(device))
        
        loss_c = criterion(outputs_student, targets)
        loss_v = mse_loss(emb_student, emb_teacher)
        loss = loss_c + 0.1*loss_v

        loss.backward()

        optimizer.step()
        torch.cuda.synchronize()

        '''==========================GRADIENT CHECKING================================='''
        grad_of_params_student = {}
        for name, parameter in student_net.named_parameters():
            grad_of_params_student[name] = parameter.grad
            #print(name, parameter.grad)
            #print('checking student: ', parameter.size())

        grad_of_params_teacher = {}
        for name, parameter in teacher_net.named_parameters():
            grad_of_params_teacher[name] = parameter.grad
            #print('checking teacher: ', parameter.size())
        
        print('student: ', grad_of_params_student['module.classifier1.weight']) # for student net
        print('teacher: ', grad_of_params_teacher['module.linear1.weight']) # for teacher net
        '''============================================================================'''

        train_loss += loss.item()
        _, predicted = outputs_student.max(1) #max(1): second value returns argmax
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

and the main function is below:

for epoch in range(start_epoch, start_epoch+100):
    train(epoch)

The code does not calculate gradient when only loss_v is applied.

How can i fix this bug?

Like the forward hooks you have created, you should create a backward hook like:

global glb_grad_teacher
def Get_grad4teacher(self, ingrad, outgrad):
   global glb_grad_teacher
   glb_grad_teacher = outgrad
   return None
t_emb_layer.register_backward_hook(Get_grad4teacher)

And then after you have done backward(), you can look at glb_grad_teacher to look at the gradient for t_emb_layer for that loss function

Thank you @tumble-weed.

Is the usage of layer.register_forward_hook correct?
I want to calculate loss value from hooked values with register_forward_hook function from middle of network.

And I’ve used your register_backward_hook function, but all the values are zero.
Implemented below:

global glb_grad_student
def Get_grad4student(self, ingrad, outgrad):
   global glb_grad_student
   glb_grad_student = outgrad
   return None
glb_grad_student = torch.tensor(torch.zeros(train_batch, num_emb), requires_grad=True, device=torch.device(device))

in the train function,

        loss.backward()
        s_emb_layer.register_backward_hook(Get_grad4student)#
        print(glb_grad_student)

and the output is

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', requires_grad=True)

What’s wrong in my code?

I think you used the code directly. Looking at your previous post it seems that emb_teacher has requires_grad=False, maybe you want to use the backward hook for the student. I hope you are getting the technique behind using register_backward_hook, so you can adapt it for your use cases.

The teacher_emb is not needed to calculate gradient.
So I’ve fixed required_grad as False.
Only student_emb in needed to calculate gradient.

for param in student_net.parameters():
    param.requires_grad=True
for param in teacher_net.parameters():
    param.requires_grad=False

hmm, do you see the loss changing when you train?

Nope, sadly… Nothing happened.

ok that’s in line with the student grads being all zero, if it wasn’t i’d be confused. Now all we have to see is why the student grad is 0.

try making a fake loss like loss_fake = emb_student.sum(). Doing backward on with this simple loss, you should expect non-zero gradients to propagate to emb_student .If it doesn’t then there is a snag some-place

Also i found out from here that you don’t need to do register_backward_hook, you can just do emb_student.retain_grad() to see if it is getting a gradient. try out this method as well to cross check once.

Same thing is happened.
Every gradients are zero. If i use the output of network not from register_forward_hook, the gradients are not zero.(Checked.)
Can you see my new question in here?

The same contents, but maybe more easy to understand what I want to say.

Thank you, @tumble-weed