Weights channels of ConvNeXt model

I am trying to get the weights of conv2d layer of ConvNeXt model from torchvision.

When I access the weights like the following:

model.features[m][n].block[b].weight

for example, for this layer

Conv2d(192, 192, kernel_size=(7, 7), stride=(1, 1), padding=(3, 3), groups=192)

The output weight is

(192, 1, 7, 7)

normally we would expect (192,192,7,7), is it because the output channels are grouped as groups =192, so this group of output channels shares the same weight?

Yes, the groups=192 argument defines that each filter uses a single input channel to create the activation map (output channel) instead of using all input channels as would be in the default setup.

Thank you very much for your lightning response!

I am trying to visualize the weights of ConvNext Model.

The following is a function to get linearized connection between two layers of weight from lucid (tensorflow),
I was wondering if you know.
What the corresponding element of the following to pytorch?

tf.placeholder_with_default(tf.zeros([1,224,224,3]), [None,None, None, 3]),
tf.placeholder("int32", []),
tf.gradients(t_center[n_chan2], [T(layer1)])[0],
grad.eval(...)

I have tried to use torch.autograd.grad(x,y,torch.ones_like(y)) but it seems not working.


@functools.lru_cache(128)
def get_expanded_weights(model, layer1, layer2, W=5):

  """Get the "expanded weights" between two layers.

  Arguments:
    model: model to get expanded weights from
    layer1: earlier layer to expand weights between
    layer2: later layer to expand weights between
    W: spatial width of expanded weigths

  Returns:
    Expanded weights as numpy array of shape 
    [W, W, layer1 channels, layer2 chanels]


  Discussion:

  Sometimes the meaningful weight interactions are between neurons which aren’t 
  literally adjacent in a neural network, or where the weights aren’t directly 
  represented in a single weight tensor. A few examples:

  * In a residual network, the output of one neuron can pass through the 
    additive residual stream and linearly interact with a neuron much later 
    in the network.
  * In a separable convolution, weights are stored as two or more factors, 
    and need to be expanded to link neurons.
  * In a bottleneck architecture, neurons in the bottleneck may primarily be 
    a low-rank projection of neurons from the previous layer.
  * The behavior of an intermediate layer simply doesn’t introduce much 
    non-linear behavior, leaving two neurons in non-adjacent layers with a 
    significant linear interaction.

  As a result, we often work with “expanded weights” -- that is, the result 
  of multiplying adjacent weight matrices, potentially ignoring non-linearities. 
  We generally implement expanded weights by taking gradients through our model, 
  ignoring or replacing all non-linear operations with the closest linear one.

  These expanded weights have the following properties:

  * If two layers interact linearly, the expanded weights will give the true 
    linear map, even if the model doesn’t explicitly represent the weights in a 
    single weight matrix.
  * If two layers interact non-linearly, the expanded weights can be seen as 
    the expected value of the gradient up to a constant factor, under the 
    assumption that all neurons have an equal (and independent) probability of 
    firing.

  They also have one additional benefit, which is more of  an implementation 
  detail: because they’re implemented in terms of gradients, you don’t need to 
  know how the weights are represented. For example, in TensorFlow, you don’t 
  need to know which variable object represents the weights. This can be a 
  significant convenience when you’re working with unfamiliar models!
  
  """

  # Set up a graph for doing attribution...
  with tf.Graph().as_default(), tf.Session(), gradient_override_map({"Relu": lambda op, grad: grad, "MaxPool": MaxAsAvgPoolGrad}):
    t_input = tf.placeholder_with_default(tf.zeros([1,224,224,3]), [None,None, None, 3])
    T = render.import_model(model, t_input, t_input)

    # Compute activations; this gives us numpy arrrays with the right number of channels
    acts1 = T(layer1).eval()
    acts2 = T(layer2).eval()

    # Compute gradient from center; due to overrides this just multiplies out the weights
    t_offset = (tf.shape(T(layer2))[1]-1)//2
    t_center = T(layer2)[0, t_offset, t_offset]
    n_chan2 = tf.placeholder("int32", [])
    t_grad = tf.gradients(t_center[n_chan2], [T(layer1)])[0]
    arr = np.stack([t_grad.eval({n_chan2: i, T(layer1): acts1[:,0:W,0:W]})[0] for i in range(acts2.shape[-1])], -1)

    return arr