How can I simply get gradient statistics (for example minibatch variance) during training to monitor it?
For intra-batch “per sample contributions”, no, there isn’t a way in general, though there are some tricks you could try.
For inter-batch statistics, you can do that similar to what the Adam optimizer (and other similar optimizers like LAMB) do, which essentially is one of the Welford-style online algorithms for the variance with more or less sophistication around subtracting the mean.
Best regards
Thomas