I need to capture the correlation between the channels for my task. I was wondering if depth-wise separable convolution is able to do so. As far as I know, it just takes the number of input and output channels and I am not sure if it is able to capture dependencies between the channels at different receptive fields with different kernel sizes. Do you guys suggest any other forms of convolution to do so? Why don’t people simply use another convolution for this task in well-known architectures as they do for temporal or spatial subspaces?
Would “correlation between channels” mean that each channel of size
[height, width] would be correlated with its neighboring channels?
If so, you could probably slice a single channel from the input activation and use the functional API to perform the convolution (which is in fact performing a cross-correlation) with each other channel via
Thanks for the response. Yeah, I meant that. Why should it be a 2D convolution? Isn’t it possible to deploy a separate 1D convolution for the channels? To make it more clear, consider that we have a tensor (video) with the size of T * C * H * W, where T is the number of time-steps, C the number of channels and H,W are the spatial height and width. I have captured the spatial info using ResNet and temporal info using a 1D temporal convolution. However, there are still some useful information in the semantic subspace which can be extracted by getting the correlations between the channels. Am I able to use a separate 1D channel-wise convolution for that with the kernel size of 3? Why don’t people care about the correlation between the channels?(most of the popular networks like ResNet don’t have that)
I don’t understand this approach.
This would calculate the cross-correlation between (randomly initialized) filters and each channel in the activation.
I thought you would like to calculate the correlation between channel, i.e.:
- corr(channel0, channel1)
- corr(channel0, channel2)
I’m not sure, what your actual use case is and how this would be used, so you would need to explain a bit more. Pseudo code might be helpful, too.
I want to take convolutions over the channels instead of pixels. The input shape is “Batch size * Num_Channels * Num_Timesteps * H * W”. There are three subspaces in this task (action recognition); spatial, temporal, and semantic (channels). What should be the number of input and output channels in this regard? In the below code I multiplied other subspaces dimensions (T * H * W) as the number of channels. What should be the shape of the tensor that is going to be passed to “channel_conv1d”?
def __init__(self, input_shape, kernel_size):
self.kernel_size = kernel_size
nun_channels = input_shape
num_timesteps = input_shape
spatial_window_size_1 = input_shape
spatial_window_size_2 = input_shape
n_virtual_channels = num_timesteps * spatial_window_size_1 * spatial_window_size_2
self.channel_conv1d = Conv1d(n_virtual_channels, n_virtual_channels, kernel_size)
def forward(self, input):
input_shape = input.size()
BN, C, T, H, W = input_shape
tensor = input.permute(0, 3, 4, 2, 1)
tensor = tensor.contiguous()
tensor = tensor.view(-1, C)
tensor = self.channel_conv1d(tensor)