Assume that a convolutional layer consists of a single 3x3 kernel and the depth of input is 5. In this case, the kernel is in fact a 3x3x5 filter where the dot product of inputs and the filter is added to bias to generate a single output. How can I modify this in PyTorch to implement something like:

find the dot product of each filter for different input channels separately.

connect the output of each channel to another neuron with trainable weights and a bias.

In other words, instead of summing the output of different input channels, I would like to find a weighted sum of these dot products.

Using groups was one of the solutions I thought of, but I think the outputs are concatenated automatically, so I will not have access to outputs in order to feed them to learnable weights in the next layer.

Is it possible to implement this structure using basic operations in PyTorch?
If yes, how could the answer be generalized to k nxnxd kernels?

As far as I understand, you would like to use a channel-wise convolution with a “per-channel” weighting?
Using groups=5, each input channel will have its own conv kernel.
The dot products of each kernel will then be summed to create a scalar input.
Now, instead of summing the dot products, you would like to multiply each one with a scalar (and add a bias), right?
Well, I’m not sure, if you will get any benefit from this, since it’s a linear operation.
Assuming each kernel performs the following operation:

out(N, C) = bias(C) + \sum[ weight(C, k) x input(N, k) ]

Now you could add your weighting like this (let’s name the scalar a and the bias b):

out(N, C) = bias(C) + \sum[ weight(C_out, k) x input(N, k) * a + b]
out(N, C) = bias(C) + \sum[ a*weight(C_out, k) x a*input(N, k) + b]

In my opinion, both the scalar a as well as the bias term b can be learned by the conv bias and the conv weights.
Assuming the model would start with kernel weights of ones and the perfect weighting for the kernel weigths would be [1, 1, 1, 1, 5]. Using your approach, the model could learn these weights, but on the other side, the model can just learn to scale up the kernel weights and compensate the bias.
It’s similar to stacking some linear layers on top of each other without a non-linearity. f = w1 * w2 * x can be written as f = w3 * x.

Could you explain a bit more about your use case?
Maybe I’m missing something.

I agree with what you have explained. However, what I had in mind was to apply a non-linearity after the first layer. Therefore, we cannot combine the two steps because the transformations are not linear anymore.