All gradients are zero except for the last layer

Hi,

I am training a network for the video classification task, I am using Cross entropy loss for learning, the issue is that after many epochs my networks accuracy remains the same (1%) and loss is not coming down, I inspect the issue and noticed that the gradients are non-zero only for the layer before loss calculation, I also made sure that the requires_grad flag to be True for all network parameters.

here is my code:

optimizer = optim.Adam(net.parameters(), lr=args.lr)
criterion = nn.CrossEntropyLoss()
for epoch in range(args.start_epoch, args.epochs):
    for i, data in enumerate(train_loader):
                frames, labels = data
                frames, labels = frames.cuda(), labels.cuda()
                inputs = frames
                optimizer.zero_grad()
                outputs = net(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

I am pretty sure the problem is not with the optimizer since before even taking the step I am having the issue right after initiating the training session.

After doing the first back propagation the for the list(net.parameters())[-1] I have non zero gradients which corresponds to the bias of last fully connected layer, but for the rest of parameters they are all zeros.

I appreciate any suggestions about why I am having this issue,

Thanks in advance.

Hi,

First thing i may be true or false. But i did faced the same issue here is the reason which i found out.Even though i made my gradients true for all layers except the one layer (last or middle) . My rest of the values are zeros same like you.Because as the gradient is calculated with respect to all the network parameters the layer which is set requiregrad=False acts like a wall and nothing passes through it so theoretically all the values left side of the layer(requiregrad=False) will be zero but right side will behave normally.

Hi,

Thanks for your response, in my case I set the requires_grad True for all the layers, I just noticed that I confused the drop out probability with keep probability that was set to 1 for the layer before the last one and that was why i got zeros for the rest of gradients.

Thanks

2 Likes