Back-propagated gradients vs Weight gradients?

Hi all,

I am trying to reproduce Glorot and Bengio’s work on “Understanding the difficulty of training deep ffnn” using PyTorch and extend the same analysis to more scenarios and more metrics.

I have few questions regarding the two kinds of gradients that are analyzed and how to extract them in PyTorch:

  • The weight gradients is the dL/dWi
    - Is the back-propagated error is dL/dXi?

I think they mentioned it the opposite order they show the equations (13 and 14) and could lead to the little misunderstanding I am having.

Is this the right way to extract the Weight gradients?

How to extract the back-propagated error per-layer? To then do the histograms