I have trained a CNN with many feature maps. When inspecting the weights of the kernels of the individual conv2d layers, i figured that in the first layer right after the raw input there are two (of 16) kernel maps all zero. The remaining kernel maps through all layers are populated with non-zero values. Is it a common thing to happen? The model seems to fit my training data, so far, so I guess this is not a big problem?
I have never seen this happen, and I don’t believe that it is common,
so you should definitely check your code for a bug or design error.
I think it could plausibly happen, though. Here’s a not-too-far-fetched
scenario that I could imagine:
Weight decay, if you are using it, would tend to drive your weights toward
zero. If the gradients for some of your weights end up being zero, there
will be nothing to push the weights back away from zero.
If your forward pass goes through the flat part (
x < 0) of a
ReLU or the
saturated part (large positive or negative value) of a
Sigmoid, you will
get zero gradients on the backward pass.
Some things to check:
Do your weights start off zero immediately after (random) initialization?
Do your gradients become zero at some point? If so, can you identify
Do your weights decrease toward zero before becoming zero? Are
gradients small or zero while this is happening? This would be consistent
with weight decay and zero gradients driving the weights to zero.
If you retrain the network from scratch (with differing random initialization),
do the same kernels become zero every time, or do different sets of
kernels end up zero?
thanks for the reply!
Yes, I was in fact using weight decay in that particular instance. I am just training several models with and without different levels of weight decay, and will see whether the wd is the cause of this, as soon as the training is done.
Also, yes, I am using ReLU in between the conv-layers, followed by Batch Normalization and Dropout. But aint that a common thing to use ReLU plus weight decay in conv networks?
Will check the initial weights, and their values within the first few iterations, as well as the gradients. Do you have a good procedure for monitoring the gradients within the network? I guess its done via the backward_hook method, right? I would then collect multiple gradients every time a batch has been processed and compute the average of their elementwise squared entries, or sth like that?
Also, I could probably inspect the relu activation output, right away, without inspecting the gradients, right? If the ReLU outputs all-zero multiple times, this could be the cause?
Have you tried to initialize weights with Kaiming or Xavier initializer? May it be the case that you have zeroed weights from the start and further?
thanks for the reply!
yes, I am initializing all the conv layers with xavier. For the batch normalization layers, I leave weights = all 1 and bias = all 0. Is that ok, or should they also be randomized?
Not sure whether BatchNorm has weights and biases