GradCAM zero attributions: GAP vs no-GAP Neural Network

I am styding the well-known Kaggle’s medical dataset

and I have trained two variations of an EfficientNet-B2 network;

  • one structured as “conv layers - GAP layer - class scores” (Network 1)
    (i.e. as downloaded from PyTorch torchvision)
  • one structured as “conv layers - Flatten - class scores” (Network 2)
    (i.e. I have removed the GAP layer and have flattened the output of the last conv layer)

The networks are comparable in terms of generalization ability as they yield
similar classification report on the test set. The test set consists of 10% data per class resulting in 2119 images it total.

  • Network 1 test accuracy: 1997/2119
  • Network 2 test accuracy: 1971/2119.

Then, in both networks I applied GradCAM (ReLU version) on the correctly classified test images,
and then I observed the following weird result regarding the number of GradCAM maps that are produced
by the two structures:

  • Network 1: 8 out of the 1997 attribution maps have 0 values (sum of all elements is 0)
  • Network 2: 1037 out of the 1971 attribution maps have 0 values

In other words, replacing GAP with Flattening seems to have a great influence on the quality of the produced maps as Network 2 produces almost 1000 more 0-valued maps.

Is there an intuition/explanation on this ? What could be the cause ?