Check gradient flow in network

for completeness
I dont reccomend Tensorboard for many reasons, sure it is easy, but is a bit limited. But if you want these style graphs…

with SummaryWriter(log_dir=log_dir, comment="GradTest", flush_secs=30) as writer: 
    #... your learning loop
    _limits = np.array([float(i) for i in range(len(gradmean))]) 
    _num = len(gradmean)
    writer.add_histogram_raw(tag=netname+"/abs_mean", min=0.0, max=0.3, num=_num,
                             sum=gradmean.sum(), sum_squares=np.power(gradmean, 2).sum(), bucket_limits=_limits,
                             bucket_counts=gradmean, global_step=global_step)
    # where gradmean is np.abs(p.grad.clone().detach().cpu().numpy()).mean()
    # _limits is the x axis, the layers
    # and
    _mean = {}
    for i, name in enumerate(layers):
        _mean[name] = gradmean[i]
    writer.add_scalars(netname+"/abs_mean", _mean, global_step=global_step)
4 Likes

gradients_3
@RoshanRane @alwynmathew
I got this graph by using the code above and I am not sure exactly how to interpret it. Does it mean that most of the layer have average gradient = max gradient and hence showing exploding gradient issue? Any thoughts?

3 Likes

May I ask what’s the proper value range for the mean of the gradient? In my experiments, the mean gradient sometimes is larger than 0.05.

Hello , RoshanRane when I run you code there was an error occur
ave_grads.append(p.grad.abs().mean())
AttributeError: ‘NoneType’ object has no attribute 'abs’
Do you know how to solve it?

1 Like

To get rid of the error, check p.grad for None before logging. You might either log 0 for these cases or just skip it (depending on what you deem appropriate).

Now, the reason this happens is likely that one of your parameters has not been used in the computation in a way that it needs gradient. This could be by design of your network and benign or it could be a problem in your implementation.

hello. why " The larger the number of weight parameters, the lower the gradient has to be so it does not explode"?

Handwavy answer: latents propagate forwards, gradients propagate backwards. There adverse effects on either direction and are dependent on what the differentiable function is. On forward direction, a convolution for example, is a weighted sum of projections of weights on latents. The wider the layer the higher the absolute values of the following latents, on a few steps you can reach inf.

Formal discussion on this topic in in 2010 Glorot & Bengio, Understanding the difficulty of training deep feedforward neural networks, and 2015 Ioffe and Szegedy Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

If you want to see this explode on your own, take a model with and without batch norm. train it for a few steps at a high learning rate. Batch norm keeps the latents from exploding.

1 Like


Hi, I’m getting this error while running this on my script.

Can you please give me some suggestions to convert the tensor array to numpy?

The error message gives the suggestion to push the tensor to the CPU first via tensor.cpu().
Did you try to use this suggestion?

I’m pretty new to PyTorch and don’t have much idea on this:
I changed this part in the code:

            ave_grads.append(p.grad.abs().mean())
            max_grads.append(p.grad.abs().max())

to

            ave_grads.append(p.cpu().grad.abs().mean())
            max_grads.append(p.cpu().grad.abs().max())

But its giving the Nonetype ahs no attribute abs() error
Am I doing something wrong here?

It seems this particular parameter doesn’t have a valid gradients stored in its .grad attribute.
This would be the case if e.g. this parameter was never used in the forward method or if you haven’t yet calculated the gradients via a backward call.

[tensor(0.0003, device='cuda:0'), tensor(0.0002, device='cuda:0'), tensor(0.0003, device='cuda:0'), tensor(0.0002, device='cuda:0'), tensor(0.0002, device='cuda:0'), tensor(0.0002, device='cuda:0'), tensor(0.0006, device='cuda:0'), tensor(0., device='cuda:0'), tensor(4.3869, device='cuda:0'), tensor(0., device='cuda:0'), tensor(3.9132, device='cuda:0'), tensor(7.3027e-05, device='cuda:0'), tensor(0.0018, device='cuda:0'), tensor(0.1991, device='cuda:0'), tensor(2.9446e-05, device='cuda:0'), tensor(0.0005, device='cuda:0'), tensor(0.0356, device='cuda:0'), tensor(4.6311e-05, device='cuda:0'), tensor(0.0013, device='cuda:0'), tensor(0.0667, device='cuda:0'), tensor(1.4949e-05, device='cuda:0'), tensor(0.0002, device='cuda:0'), tensor(0.0109, device='cuda:0'), tensor(0.0187, device='cuda:0'), tensor(0.0050, device='cuda:0'), tensor(0.0207, device='cuda:0'), tensor(0.0016, device='cuda:0'), tensor(0.0165, device='cuda:0')]

This is what ave_grads returns when i try to print it. Does it help in any way?

Unfortunately, this doesn’t really help as it doesn’t show which parameter raises the None gradients.

I actually got it fixed by add the cpu() in the end.

            ave_grads.append(p.grad.abs().mean().cpu())
            max_grads.append(p.grad.abs().max().cpu())

add .cpu().detach().numpy() to your code when you append the value to the list

this issue is caused because you are trying to append a data which is in gpu memory to the numpy.

Hi Hao,
Are you able to resolve that error? If yes, could you please tell me how?