Gradcheck checks a single function (or a composition) for correctness, eg when you are implementing new functions and derivatives.
For your application, which sounds more like “I have a network, where does funny business occur”, Adam Paszke’s script to find bad gradients in the computational graph might be a better starting point. Check out the thread below.

def is_bad_grad(grad_output):
if grad_output.requires_grad == True:
print('grad_ouput have grad')
grad_output = grad_output.data

Error:

if grad_output.requires_grad == True:
AttributeError: 'NoneType' object has no attribute 'requires_grad'

Try 2:

def is_bad_grad(grad_output):
if grad_output.requires_grad == False:
print('grad_ouput doesnt have grad')
grad_output = grad_output.data

Error:

grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
grad_ouput doesnt have grad
Traceback (most recent call last):
grad_ouput doesnt have grad
grad_ouput doesnt have grad
if grad_output.requires_grad == False:
AttributeError: 'NoneType' object has no attribute 'requires_grad'

Please give more details, so that i can debug this issue.

It seems that something you makes your output not require grads as much as one would expect, this could happen due to networks being in .eval() instead of .training(), or setting requires_grad = False manually or volatile or something entirely different…
If you had a minimal demo of how it happens, it would be easier to find out why it is not working.

I use a simple trick. I record the average gradients per layer in every training iteration and then plotting them at the end. If the average gradients are zero in the initial layers of the network then probably your network is too deep for the gradient to flow.
So this is how I do it -

This API to plot

def plot_grad_flow(named_parameters):
ave_grads = []
layers = []
for n, p in named_parameters:
if(p.requires_grad) and ("bias" not in n):
layers.append(n)
ave_grads.append(p.grad.abs().mean())
plt.plot(ave_grads, alpha=0.3, color="b")
plt.hlines(0, 0, len(ave_grads)+1, linewidth=1, color="k" )
plt.xticks(range(0,len(ave_grads), 1), layers, rotation="vertical")
plt.xlim(xmin=0, xmax=len(ave_grads))
plt.xlabel("Layers")
plt.ylabel("average gradient")
plt.title("Gradient flow")
plt.grid(True)

plug this API after the loss.backward() during the training as follows -

loss = self.criterion(outputs, labels)
loss.backward()
plot_grad_flow(model.named_parameters())

def plot_grad_flow(named_parameters):
'''Plots the gradients flowing through different layers in the net during training.
Can be used for checking for possible gradient vanishing / exploding problems.
Usage: Plug this function in Trainer class after loss.backwards() as
"plot_grad_flow(self.model.named_parameters())" to visualize the gradient flow'''
ave_grads = []
max_grads= []
layers = []
for n, p in named_parameters:
if(p.requires_grad) and ("bias" not in n):
layers.append(n)
ave_grads.append(p.grad.abs().mean())
max_grads.append(p.grad.abs().max())
plt.bar(np.arange(len(max_grads)), max_grads, alpha=0.1, lw=1, color="c")
plt.bar(np.arange(len(max_grads)), ave_grads, alpha=0.1, lw=1, color="b")
plt.hlines(0, 0, len(ave_grads)+1, lw=2, color="k" )
plt.xticks(range(0,len(ave_grads), 1), layers, rotation="vertical")
plt.xlim(left=0, right=len(ave_grads))
plt.ylim(bottom = -0.001, top=0.02) # zoom in on the lower gradient regions
plt.xlabel("Layers")
plt.ylabel("average gradient")
plt.title("Gradient flow")
plt.grid(True)
plt.legend([Line2D([0], [0], color="c", lw=4),
Line2D([0], [0], color="b", lw=4),
Line2D([0], [0], color="k", lw=4)], ['max-gradient', 'mean-gradient', 'zero-gradient'])

I have a peculiar problem. Thanks to the function provided above I was able to see the gradient flow but to my dismay, the graphs show the gradient decreasing from right side to left side, which is as God intended. But, in my case the graphs show the gradient decreasing from left side to right side, which is clearly wrong, albeit, I will be highly grateful if somebody can tell me what’s going on with the network.

It has a convolutional block followed by an encoder and decoder. The network is fully convolutional.

I have a class of VGG16 and I wonder if named_parameters in your function refers to model.parameters()? model is an instance of class VGG16 by the way. If your response is ‘yes’ then I receive an error ‘too many values to unpack (expected 2)’ for command ‘for n, p in model.parameters():’. Do you see the reason?

For any nn.Module instance, m.named_parameters() returns an iterator over pairs of name, parameter, while m.parameters() just returns one for the parameters.
You should be able to use m.named_parameters().

This is for a single layer GRU. I was surprised to see the gradient of the hidden state stay so small. The only thing I can think of as to why this would be the case is because the hidden state is re-initialized with each training example (and thus stays small), while the other gradients accumulate as a result of being connected to learned parameters. Does that seem correct? Is this what a plot of the gradient flow in a single layer GRU should typically look like?

Alternatively, for a 4 layer LSTM, I get the following output:

Does that seem correct? Is this what a plot of the gradient flow in a multi-layer layer LSTM should typically look like? The larger gradient values are from the initial epochs. I am not sure when they are so much larger to start with. Thoughts?

Just a comment. I’m trying to measure the behaviour of Generative nets, specifically a UNet, so I looked at this, Implemented something in tensorboard and realized that while yes, one does get gradient flow graphs similar to what you show, it isnt quite as straightforwards. The larger the number of weight parameters, the lower the gradient has to be so it does not explode. So you may want to look at the gradients in logscale. Here are 2 representations. The first is similar to the code above, where x:layer number (0 thru 28), y:abs mean gradient (or signed max), z: iteration; uses SummaryWriter.add_histogram_raw() the second x:iteration, y:absmean gradient; using .add_scalars()

Even though the middle gradients are close to zero, they still flow, and the net learns. The higher values are first and last layer which have 5e2 and 1e3 parameters, whereas the parameters of the center layers are~5e6 to 1e7
I tried different inits and the shape of the gradients always ends up being similar, and the UNet learns well. It seems to me that this U shape is a consequence of the netowrk architecture, and shouldnt be necessarily interpreted as wrong.

for completeness
I dont reccomend Tensorboard for many reasons, sure it is easy, but is a bit limited. But if you want these style graphs…

with SummaryWriter(log_dir=log_dir, comment="GradTest", flush_secs=30) as writer:
#... your learning loop
_limits = np.array([float(i) for i in range(len(gradmean))])
_num = len(gradmean)
writer.add_histogram_raw(tag=netname+"/abs_mean", min=0.0, max=0.3, num=_num,
sum=gradmean.sum(), sum_squares=np.power(gradmean, 2).sum(), bucket_limits=_limits,
bucket_counts=gradmean, global_step=global_step)
# where gradmean is np.abs(p.grad.clone().detach().cpu().numpy()).mean()
# _limits is the x axis, the layers
# and
_mean = {}
for i, name in enumerate(layers):
_mean[name] = gradmean[i]
writer.add_scalars(netname+"/abs_mean", _mean, global_step=global_step)